In the last few days I’ve been hearing that the death numbers in the US were not reliable due to inflation of the numbers in the hospital. On the surface, this seems like a real possibility. The allegation is that if someone dies of Pneumonia but also has tested positive for COVID-19, the death is classified as COVID-19, even if Pneumonia might be more accurate. Evidence for this points to the Pneumonia numbers for March being lower than normal for the month. So this is concerning, because we know that the numbers of infections that are recorded are not very consistent and likely are just a fraction of the people truly infected. This, of course, is largely due to limited testing and oversampling of the symptomatic. But the death data has told the best story of the severity of the outbreak in different countries. So I decided to evaluate this myself to see if this is a realistic claim.
Unfortunately, there aren’t really clean, accessible datasets classifying deaths by county in the US across the year that a person who is doing COVID-19 research at night for fun can realistically engage with. So I had to come up with something that was close. First, I have a dataset from the Census that shows estimated populations per year as well as the numbers of deaths (and live births, and a few other things). This is done by county, which is what I was looking for. I blended this dataset into my COVID-19 death data by county and then was close.
What is missing from this is the breakdown of the causes of these deaths. This isn’t easily found by county in one dataset, so I assumed that the leading causes of deaths each year is going to be fairly consistent across the US. There may be differences in some categories (shooting deaths, farm machinery accidents, etc.), but in general the top 3-4 should be consistent. I figured I’d start by evaluating the leading causes of death in the most highly COVID-impacted county in the world, New York County (NYC).
From the above you can see that NY County had just under 10K deaths in 2016 (about 830 deaths per month). My Census dataset confirms this number, but shows that 2018 saw over 12K deaths. Not sure what the difference was. So I’m going with the 2016 numbers for my base number of deaths. Thinking about which of these causes might be comorbidities with COVID-19, I see CLRD (Chronic Lower Respiratory Disease) as a very likely candidate and 2016 saw 304 cases (just over 25 per month). Stroke seems unlikely, as the symptoms of that condition are very different from COVID-19. Heart disease is obviously the big hitter, with 2902 deaths in 2016 (about 240 per month). Maybe heart disease deaths might be conflated with COVID-19, but it doesn’t seem very likely. Cancer at 2,526 deaths per year (210 per month) seems very unlikely to be classified as COVID-19… cancer patients often have their own hospitals, doctors, and wards, and generally are well understood as cancer for long periods of time. I’m planning to rule the cancer deaths completely out. Pneumonia is 7th on this list (not shown above) with 263 deaths in 2016 (about 22 per month). So lets say that in 2016 there were 25+240+22 deaths that could have potentially been comorbidities with COVID-19 during the 2020 outbreak. This translates to 287 cases, which is 35% of the number of deaths per month in NY County. I will then assume that the exaggeration in the COVID-19 death count will be limited to 35% higher than the true count. We’ll evaluate the raw number of deaths in each county that 35% of the total would equal and then compare to the reported 2020 COVID-19 deaths. Make sense?
Above we see the most conservative case comparing COVID-19 reported deaths by county with the potential comorbidity numbers. The assumptions for this uber-conservative chart are that 1) these reported COVID-19 deaths occured over 2 full months. Because of this assumption, we’ll calculate 2 months worth of comorbidities, 2) 100% of potential comorbidity deaths are positive for COVID-19 and are classified as COVID-19 deaths, and 3) 100% of heart disease deaths in these counties are classified as COVID-19. What do we see? In the hardest hit county, there were 8x the number of COVID-19 deaths reported in this timeframe than the sum of the potential morbidities. So if the New York County numbers are getting inflated by questionable death accounting practices, it is only by about 700 deaths out of a total of 5820. By the way, this 5820 is about half the total deaths that New York County should expect for a normal year! See the other counties here where the COVID-19 deaths under these rigid assumptions are at least half the number of the sum of all the other potential comorbidities, including heart disease. What this is telling us is that COVID-19 has already replaced heart disease as the leading cause of death in these counties for the whole year of 2020.
In the above table, I’ve eased the assumptions a bit to be just slightly conservative. First, I change the 2 months to 1.5, which is much more accurate for most counties. Second, and most importantly, I only assume that 30% of deaths from comorbidities during this timeframe are positive for COVID-19 (the actual number in New York state right now is still under 1% of the population, but we’ll assume that this number is 30x higher in the most susceptible populations. Now see the differences! New York County has the potential to only inflate their COVID-19 deaths by 143 on an overall number of 5820. This translates to an error of 2.4%. And most likely that’s high too.
Conclusion: There is no concern over miscalculation of COVID-19 deaths due to inflation of the numbers by counting comorbidities. I presume this concept made its way to the mainstream media news due to a desire to tell a good story or share comforting stats. Or maybe it was just a political effort. Regardless, this is why I really dislike listening the news discuss stats about this outbreak. Across the political spectrum, they’re failing to properly report numbers and statistics. Here are my suggestions if any professional news people are reading this:
- Focus more on the data and understanding what its limitations are and less on flashy graphics. I suspect this error may have something to do with the lack of seasoned data scientists and statisticians on the news team combined with a preponderance of less-experienced, recent grads with whiz-bang Tableau visualization skills.
- Stop reporting hard numbers of Cases and Deaths if you’re comparing regions. 200 deaths in California is going to be a much less severe situation than 100 deaths in Orleans Parish, Louisiana.
- The Death to Cases ratio is garbage. Everyone wants this number because we want to compare it to the flu. We’ve been getting the flu and measuring the number of cases for hundreds of years. We have a good statistical sample and can estimate the number of cases well. We have NO idea how many people have actually been infected by COVID-19 yet. The best numbers we have are from Iceland because they randomly sampled the whole population. Their death to cases ratio, by the way is .4%. This is a bit higher than typical flu numbers, but don’t rejoice quite yet, because Iceland’s testing and quarantine strategy seems to be keeping their death numbers low.
Another excellent post. Having some unbiased data is really appreciated. This has been a question on my mind—not as the result of news media—but comparing the current surge in hospitalizations and deaths to an average year. Few thoughts:
1. I have been hearing reports that heart disease deaths may be increasing in NYC this year and the heart muscle may be susceptible to attack by COVID-19 based on the presence of the ACE2 receptors which are apparently the target of the virus’s spike protein. So, it may be that the death rate is being underreported.
2. Hopefully social distancing will keep the death total down—within or below the range of the yearly flu—but this may well be distorted because of course a flu year doesn’t require a national shutdown. Once the data is in it would be interesting to see if flu deaths are down. Might provide another dataset with better background data to evaluate the effectiveness of social distancing.
3. I wonder if the death rate from other causes for the rest of the year and next couple years is effected. COVID-19—with its apparent targeting of the older segment of the population and people with underlying health conditions may be accelerating deaths of people that would have died from other causes in the following couple years. This could show up in the data.