Largest numbers of Active Cases and Deaths, Normalized by Population – Non-USA
Not much has changed in a few weeks in the chart above. Italy and Spain continue to have large numbers of active cases but Switzerland has less than 1/10 of the total deaths as Spain and Italy (1/3 of the deaths of those two when normalized by population). UK has a growing number of deaths, but when normalized, the UK numbers are in the Switzerland camp, not the Italy camp.
Iceland continues to recover. Their cumulative flow diagram below shows that they’re managing to maintain a consistent number of active cases with still extremely low numbers of hospitalizations and deaths.
Iceland: Cumulative Flow Diagram of Confirmed Cases, Recoveries, and Deaths
Russia is also experiencing rapid growth in cases. Lots of concern being expressed by Boris Yeltsin. Very different from the articles from a few weeks ago when Russia was apparently able to keep their growing case numbers under wraps. See the exponential case growth in the chart below.
Map representing numbers of COVID-19 cases (color) and Case Growth rates (diameter)Map representing numbers of COVID-19 cases (color) and Death rates (diameter)
In the above two maps, you first can see where the cases are growing fastest and then second where the death rate is increasing fastest. Obviously, case growth is occurring across the US. This is unsurprising, especially since there is more testing happening now. What we don’t really know from our data is whether these cases are symptomatic, whether they’re hospitalized, etc. The second map shows us that the deaths continue to happen in the same cluster areas, NYC/Mass/NJ/Conn, DC, New Orleans, Detroit, Chicago, Denver, Las Vegas, and Seattle. The majority of these are occurring in the NYC cluster.
State Data Table for 4/12/20
State data from 4/12
Latitudes of Cases / Deaths for US States
Cases and Deaths per 1000 by Latitude Ranges in the U
Just like with the rest of the world, the US also seems to be following the Latitude effect. Most of the cases/deaths to date have occurred between 40 and 45 degrees North. I’m currently evaluating if the fastest case/death rate growth also follows this latitude trend or not.
The above image shows the ‘S-shaped’ sigmoid curve that we’ve been hoping to see. It may indicate that the first phase of the outbreak is slowing. As you can see, the number of cases in Arizona is decelerating. Again, I hestitate to make any pronouncements, but I’ve been watching this trend for about a week. It could be that the state has been under-reporting data or some other factor could emerge. Or it might be that the infections have peaked for now and are slowing.
Across the US
Nothing much is changing for the heaviest-hit states. Slope of case rates and death rates continues to increase for these states. Vermont was showing signs of flattening out last week and has now has dropped out of the top 10. https://todnewman.com/?p=416
In the last few days I’ve been hearing that the death numbers in the US were not reliable due to inflation of the numbers in the hospital. On the surface, this seems like a real possibility. The allegation is that if someone dies of Pneumonia but also has tested positive for COVID-19, the death is classified as COVID-19, even if Pneumonia might be more accurate. Evidence for this points to the Pneumonia numbers for March being lower than normal for the month. So this is concerning, because we know that the numbers of infections that are recorded are not very consistent and likely are just a fraction of the people truly infected. This, of course, is largely due to limited testing and oversampling of the symptomatic. But the death data has told the best story of the severity of the outbreak in different countries. So I decided to evaluate this myself to see if this is a realistic claim.
Unfortunately, there aren’t really clean, accessible datasets classifying deaths by county in the US across the year that a person who is doing COVID-19 research at night for fun can realistically engage with. So I had to come up with something that was close. First, I have a dataset from the Census that shows estimated populations per year as well as the numbers of deaths (and live births, and a few other things). This is done by county, which is what I was looking for. I blended this dataset into my COVID-19 death data by county and then was close.
What is missing from this is the breakdown of the causes of these deaths. This isn’t easily found by county in one dataset, so I assumed that the leading causes of deaths each year is going to be fairly consistent across the US. There may be differences in some categories (shooting deaths, farm machinery accidents, etc.), but in general the top 3-4 should be consistent. I figured I’d start by evaluating the leading causes of death in the most highly COVID-impacted county in the world, New York County (NYC).
2016 New York County Deaths by Leading Cause (https://apps.health.ny.gov/public/tabvis/PHIG_Public/lcd/reports/#county)
From the above you can see that NY County had just under 10K deaths in 2016 (about 830 deaths per month). My Census dataset confirms this number, but shows that 2018 saw over 12K deaths. Not sure what the difference was. So I’m going with the 2016 numbers for my base number of deaths. Thinking about which of these causes might be comorbidities with COVID-19, I see CLRD (Chronic Lower Respiratory Disease) as a very likely candidate and 2016 saw 304 cases (just over 25 per month). Stroke seems unlikely, as the symptoms of that condition are very different from COVID-19. Heart disease is obviously the big hitter, with 2902 deaths in 2016 (about 240 per month). Maybe heart disease deaths might be conflated with COVID-19, but it doesn’t seem very likely. Cancer at 2,526 deaths per year (210 per month) seems very unlikely to be classified as COVID-19… cancer patients often have their own hospitals, doctors, and wards, and generally are well understood as cancer for long periods of time. I’m planning to rule the cancer deaths completely out. Pneumonia is 7th on this list (not shown above) with 263 deaths in 2016 (about 22 per month). So lets say that in 2016 there were 25+240+22 deaths that could have potentially been comorbidities with COVID-19 during the 2020 outbreak. This translates to 287 cases, which is 35% of the number of deaths per month in NY County. I will then assume that the exaggeration in the COVID-19 death count will be limited to 35% higher than the true count. We’ll evaluate the raw number of deaths in each county that 35% of the total would equal and then compare to the reported 2020 COVID-19 deaths. Make sense?
Most Conservative Estimate of Potential comorbidities of COVID-19. Assumptions: 1) COVID-19 total deaths happen over 2 months, so calculating 2 months of comorbidities 2) 100% of potential comorbidities are positive for COVID-19. 3) 100% of heart disease deaths would get classified as COVID-19.
Above we see the most conservative case comparing COVID-19 reported deaths by county with the potential comorbidity numbers. The assumptions for this uber-conservative chart are that 1) these reported COVID-19 deaths occured over 2 full months. Because of this assumption, we’ll calculate 2 months worth of comorbidities, 2) 100% of potential comorbidity deaths are positive for COVID-19 and are classified as COVID-19 deaths, and 3) 100% of heart disease deaths in these counties are classified as COVID-19. What do we see? In the hardest hit county, there were 8x the number of COVID-19 deaths reported in this timeframe than the sum of the potential morbidities. So if the New York County numbers are getting inflated by questionable death accounting practices, it is only by about 700 deaths out of a total of 5820. By the way, this 5820 is about half the total deaths that New York County should expect for a normal year! See the other counties here where the COVID-19 deaths under these rigid assumptions are at least half the number of the sum of all the other potential comorbidities, including heart disease. What this is telling us is that COVID-19 has already replaced heart disease as the leading cause of death in these counties for the whole year of 2020.
Slightly Conservative Estimate of Potential comorbidities of COVID-19. Assumptions: 1) COVID-19 total deaths happen over 1.5 months, so calculating 1.5 months of comorbidities 2) 30% of potential comorbidities are positive for COVID-19. 3) 100% of heart disease deaths would get classified as a COVID-19 comorbidity if they are positive for COVID-19.
In the above table, I’ve eased the assumptions a bit to be just slightly conservative. First, I change the 2 months to 1.5, which is much more accurate for most counties. Second, and most importantly, I only assume that 30% of deaths from comorbidities during this timeframe are positive for COVID-19 (the actual number in New York state right now is still under 1% of the population, but we’ll assume that this number is 30x higher in the most susceptible populations. Now see the differences! New York County has the potential to only inflate their COVID-19 deaths by 143 on an overall number of 5820. This translates to an error of 2.4%. And most likely that’s high too.
Conclusion: There is no concern over miscalculation of COVID-19 deaths due to inflation of the numbers by counting comorbidities. I presume this concept made its way to the mainstream media news due to a desire to tell a good story or share comforting stats. Or maybe it was just a political effort. Regardless, this is why I really dislike listening the news discuss stats about this outbreak. Across the political spectrum, they’re failing to properly report numbers and statistics. Here are my suggestions if any professional news people are reading this:
Focus more on the data and understanding what its limitations are and less on flashy graphics. I suspect this error may have something to do with the lack of seasoned data scientists and statisticians on the news team combined with a preponderance of less-experienced, recent grads with whiz-bang Tableau visualization skills.
Stop reporting hard numbers of Cases and Deaths if you’re comparing regions. 200 deaths in California is going to be a much less severe situation than 100 deaths in Orleans Parish, Louisiana.
The Death to Cases ratio is garbage. Everyone wants this number because we want to compare it to the flu. We’ve been getting the flu and measuring the number of cases for hundreds of years. We have a good statistical sample and can estimate the number of cases well. We have NO idea how many people have actually been infected by COVID-19 yet. The best numbers we have are from Iceland because they randomly sampled the whole population. Their death to cases ratio, by the way is .4%. This is a bit higher than typical flu numbers, but don’t rejoice quite yet, because Iceland’s testing and quarantine strategy seems to be keeping their death numbers low.
4/8 was another rough day for New York. It’s quite a different story between NYC and Los Angeles County, who has .176 cases per 1000 and .004 deaths per 1000. This is about 20x lower on cases than NYC and 50x lower on deaths. I’d really love to know why this huge difference exists. There aren’t a lot of clues in the news. I believe that NYC and California both issued shelter in place orders on the same day. Perhaps the virus was loose in NYC for weeks before any kind of reaction was taken by government, enabling a non-linear transmission effect to occur? If there’s any truth to the effects of latitude that I’ve been uncovering, that might come into play too. See chart below for US States’ Case and Death rates by latitude bands. The results in the US line up with the global results by latitude. Note that the US Population is nearly 2x as large in the 30-40 band, so the narrative can’t be that this effect is simply due to a large number of big cities in the range.
Mechanics of Building a Correlation Matrix: In case this explanation is interesting or informative to anyone puzzling over these results, to get the above correlation relationship between various features and the rate of growth of COVID-19 cases, I built a large dataset using data from Johns Hopkins (COVID-19 data), the WHO, the World Bank, and a handful of others. In this dataset, I have each country in the world captured as rows in the dataset. Each of the Features above (plus many more) is one of the columns that goes across all of the countries. This is the basic mechanics of putting together a large correlation matrix.
What does this tell us?: First off, the above table just simply lists selected features (‘Female Smoking Rate’, etc.) and their correlation using the Python Pandas correlation function. 1.0 is perfect correlation. As these are the correlations with the feature ‘Instantaneous Rate of Change’, you can see that the correlation of ‘inst_rate_of_change’ is 1.0. It is perfectly correlated with itself. I have eliminated many features with low correlation (meaning 0, not -1) just to make this more readable. This, of course, is because if correlation is close to zero, there’s likely little information about the target (Instantaneous Rate of Change of Confirmed Cases – i.e., today’s Case Growth Rate). However, if the number is between 0.2 and 0.8, I find from years of doing this that there’s enough dependence between the target and the feature to make the case that they are related in an interesting way. Statisticians like to say (probably too often), “Correlation does not Imply Causality” — which is true — but this does not mean that correlation is not valuable as the basis for hypothesis tests for causality. That’s what we’re trying to do here… find environmental factors that might be influencing the different Case Growth Rates across the world.
Is there Anything New Here? Yes, the correlations continue to change as the Case Growth Rates change across the world. By definition, I’m correlating these factors with the current day’s instantaneous slope so the correlations should continue to change. What we’ve been seeing lately is that as the slopes continue to increase across the world the Female Smoking Rate continues to increase in its correlation with the target. I think what this indicates is that the countries with the most severe slopes (Italy, New York, Spain) are probably being hit harder by women who smoke having a higher likelihood at contracting a measurable COVID-19 case. I use the word measurable intentionally here, because these rates are probably driven by countries who are only measuring cases where people have symptoms and require some sort of care. This makes this correlation probably more like a correlation with symptomatic case rates. A subtle point, maybe. One other factor that continues to increase is the negative correlation between case growth and rates of Tuberculosis in a country. This tells us that countries with lots of TB cases have slower COVID-19 case growth rates. This was mildly puzzling to me until 2 days ago when I learned of a study showing that a TB vaccine called BCG may have anti-COVID properties (I’m summarizing broadly. Here’s the link). So that’s pretty exciting to see… even this simplistic approach may have revealed something using Data Science that was not widely known.
Correlation with Rate of Deaths – 4/8/2020
Above is the correlation of the same factors as above with the Rate of Deaths from COVID-19. Note that some of the features that are highly correlated with the Rate of Contracting the Disease are less correlated with the Rate of Deaths from the Disease. This is probably not counter-intuitive. What might be counter-intuitive is that comorbidities like Diabetes rates in a country are negatively correlated with the COVID Death Rates. All I can decide is that it might take reframing the reference point. We’re aware that diabetes, high blood pressure, etc., are contributing strongly to the deaths of individuals who are infected with COVID-19. However, this study is about countries who have high rates of Diabetes, High Blood Pressure, or Air Pollution and the correlation of those factors with the Death Rate. Therefore, it is possible that a country with high rates of Diabetes, for instance, has less people who survive that disease long enough to be affected by COVID-19. Perhaps this is a sign that the advanced health care in some countries might be contributing to the numbers of deaths, largely because susceptible people are living longer in those countries? Or perhaps this is just measuring the fact that countries with high rates of diabetes or pollution have yet to be hit by COVID-19? Time will tell.
I’m posting this table most every day now so people can see the changes in th enumbers from day to day. Only four states had over 100 deaths yesterday and most states are seeing their case and death rates taper off a bit. European countries are also seeing case growth slow.
Iceland Update
Cumulative Flow Diagram for Iceland, 4/7/20 data
The chart above is similar to the ones I posted yesterday for Germany and others. The recovery data is being reported again and looks very reasonable. Looking at this as a cumulative flow diagram, we can see that Iceland is maintaining a 14-16 day cycle time for clearing new cases. This is obviously being largely driven by their scientific sampling techniques where they’re getting people who are sick tested and into quarantine right away. In nearly every case in Iceland, the recovery is occurring after the 14 days of isolation/quarantine. Looking at the stats below (from Iceland’s COVID-19 portal) this shows a 2.4% hospitalization rate with only 1/3 of those hospitalized needing to go into the ICU. Interestingly, over half of their cases are diagnosed while in quarantine. There were reports from a few days ago that I haven’t seen the raw data on that indicated that only 1% of those tested were coming back positive and that 50% of the confirmed infections were asymptomatic. This doesn’t quite make sense based on the data below, so more study may be needed. Still, what Iceland represents is a society that understands how COVID-19 is truly spreading and who is able to take steps more quickly than any other country to respond to an infection.
Summary: When thinking about what the real infection rates, hospitalization rates, and death rates might be, Iceland provides one of the only scientific answers. Keep in mind that Iceland’s numbers might not reflect the numbers from other countries completely due to the fact that they’re at a high latitude where most countries are reporting lower numbers. But the fact that they have over three times the number of active cases per 1000 people that the US currently has (4.4 per 1000 compared to 1.2 per 1000) while having nearly no deaths is very interesting.
I’ve been posting the above data pretty much each day because it tells a pretty solid story of what is happening in the US. Trends toward lower case rates and death rates continue from about a week ago when these trends started to become noticeable. Now even New Jersey is showing a slowdown in their death rate (note that their acceleration over the last 3 days is negative right now). Many of the non-NY area states are showing death rates that are essentially linear. Hoping this is a long-term trend and not just a temporary one.
Recovery Data
A while back I showed what a Cumulative Flow diagram looked like in the manufacturing world. This diagram can also be generated using Confirmed Case data and Recovery data. For a while JHU dropped their recovery data when (I think) they figured out that some countries were gaming that data to make their societies look better (Iran, China, Vietnam…). But since Europe is entering into the recovery portion of their initial outbreak, I thought I’d show a couple of European countries that are doing well (and whose data is trustworthy).
Cumulative Flow Diagram – Switzerland
Switzerland has been a curious case for a while as it is sandwiched in between countries with high death rates but managed to keep its own death rates down. Here we see the raw numbers for their confirmed cases, recoveries, and deaths in one diagram. Things to note:
When looked on the same scale as cases, deaths are very small. Keep this in mind should you be tempted to panic. The vast, vast majority of cases do NOT end in death.
The horizontal distance between the orange confirmed cases line and the green recoveries line at any point is the cycle time for the outbreak. So go on the y-axis to the 5000 case mark and go over to the right on the diagram until you hit the orange line. You can see that the 5000th case happened on 3/20/2020. Now continue on the same horizontal line to the green line. The 5000th recovery happened on 4/3/2020. This shows that the cycle time to “clear” 5000 cases is 14 days. You may remember that early on we were seeing (from China and Singapore data) a cycle time of about 22 days. I wonder if we’re seeing better cycle times in Switzerland due to advanced medical care?
The vertical distance between the lines is the number of active cases at any point. So do the same exercise that you did in #2 except this time go to 4/3/2020 on the x-axis. Trace up to the orange line. On that day you can see there had been 20K confirmed cases in Switzerland. Now go back down to the green line. There were 5K Recoveries that had occurred by 4/3. That means there were 15K active COVID-19 cases in Switzerland on 4/3. In the manufacturing world, we call that the “Work in Progress” or WIP. This is an indicator of how much work — in this case, COVID-19 active cases — in the pipeline.
I suspect we will see the cycle time shrink a bit over the duration of this outbreak as hospitals become more productive. I also expect that we’ll see the active cases at any time shrink down too, because the virus will run out of easy targets. Both of these are good things to watch for, because they’ll indicate that we humans have asserted a bit more control over the situation.
Note that the slope of the confirmed case line is decreasing. This is happening across Europe. Maybe this is a sign that the worst portion of their outbreak is behind them.
Here’s one more CFD. This time for Germany. Note the same trends.
I have been tracking this data daily since I first started to notice that there was a big difference in deaths between the Latitude ranges 40-50 and 50-60. I have blogged about this a couple of times, trying to understand why this might be. The range from 50-60 N. Latitude has about 80% of the Confirmed Cases per 1000 people as the hard-hit 40-50 range but only about 38% of the deaths. Countries in this 50-60 range include some of the ones who have been notable in their low numbers of deaths, such as Germany, Great Britain, Netherlands, Denmark, and Belgium. Many of these countries already appear to be “flattening the curve” on the first outbreak wave at least.
One of the big insights on death statistics from Europe at least is that 95% of the deaths across Europe from COVID-19 are in the over 65 demographic. This number appears to be more like 80% in the US, which may be part of the reason that the US death numbers are much lower than Europe. However, the 95% seems to be consistent across all European countries, so that helps us understand the demographics of the outbreak a bit better. This is a challenge, because there’s no consistency to data collection about age demographics. Most places aren’t even capturing it. It’s also hard to determine the age demographics that exist in each country. The European Union has great statistics on demographics, but doesn’t collect the actual breakdown across ages. So one thing we have going for us is that the World Bank captured the percent of the population over the age of 65 for every country on Earth. Blending this data into my COVID-19 dataset allows me to evaluate the population over the age 65 in each of these latitude belts.
Percent of Population over 65 by 10 degrees of Latitude
Looking at the chart above raises some interesting questions. First off, we note right away that the latitude region with the greatest number of deaths (by far) does NOT have the largest number of over 65 people as we would have expected. The second interesting question is why the numbers are so much lower in the other latitude ranges (especially those above 60 degrees North and below 30 degrees North. My suspicion is that the reason is that those other latitudes have environments that are less suitable for health in old age. North of 60 degrees is really bitter cold. Maybe people move to Florida from there when they hit 65? That’s the most positive interpretation at least… Below 30 degrees has another issue — many other issues — namely, malaria.
The COVID-19 Zone and the Malaria zone
I haven’t even opened up the latitude ranges to those south of the Equator yet, but you can see by the wide malaria zone that I don’t need to. Malaria has the greatest prevalence between 30 degrees North latitude and about -20 South latitude. Here’s a great link from the Malaria Atlas Project that shows this well. I have roughly pictured it above with the wide blue bar. Notice that there’s no overlap between the malaria zone and the red COVID-19 zone? A few possibilities exist:
Malaria is taking susceptible people before COVID-19 can get them?
Something about the environment where malaria flourishes is not ideal for COVID-19?
Wait and see, maybe time will change the COVID-19 band?
Back to our discussion about the prevalence of deaths in the COVID-19 zone. We now know that the 40-50 and the 50-60 bands have roughly the same numbers of people over 65 per capita. So how do we explain the greater number of deaths in the 40-50 band? It must be that many more people over 65 are dying in this region.
Percent of over 65 population that have died so far from COVID-19 (4/6/2020) by latitude band
Looking at the chart above (calculated using the 95% of all deaths coming from the over 65 age group) you can see that when the over 65 deaths are divided by the number of over 65 folks in each region, the 40-50 band has already lost .05% of their over 65 population. The comparison to the other regions is stark.
Why?
I’ve been thinking about why this might be happening for a few days. This is a great puzzle to me. Here are my thoughts:
This might be coincidence. Perhaps having both Italy and Spain in the band is skewing the numbers to the bad? Maybe. But what if there is an underlying cause or sets of factors that influenced the outcomes in Italy and Spain? It’s hard to separate the two yet.
This might be purely related to environmental factors in this region. There’s evidence that influenza virus transmission is heavily modulated by temperature and humidity. What if the COVID-19 virus’ transmission is far less effective in certain environmental conditions?
This might be related to the success of health care strategies in the 40-50 and also 50-60 bands. Many of these countries have single-payer, Government-sponsored care. Perhaps this has been effective at extending the lives of susceptible people? Also, the lack of malaria outbreaks (as well as other tropical diseases) in the 40-60 regions has probably also extended the lives of many.
There may be social reasons (diet, family size, elder housing, etc.) why these two regions are hardest hit (and why 40-50 is the hardest overall). This might explain why people over 65 are so much more likely do succumb to COVID-19 infections in the 40-50 region than in any other.
Here’s my favorite. I have zero evidence for this. But… what if this virus isn’t as novel as we think and 50+ years ago a similar virus ran through the 50-60 region, where it was effective due to a seasonal environmental differential. What if there was some small built-in immunity in older people in this region that is now protecting them from COVID-19?
State Data Table from 4/5/2020 – Death Slopes going linear for lots of States
It’s probably too early to project, because this could just be a temporary leveling-off. However, outside New York and Louisiana, most states Death Rates have been linear for 2-3 days (see dIROC deaths column). Zero is perfectly linear (the slope hasn’t changed in 3 days) but .001 is awfully close. Only New York is still an exponential curve at this moment in time. See some state charts below. And linear is good, because if the trend becomes linear, then the outbreak is very predictable and much more manageable.