Quick Explanation of Methodology
The CDC Wonder Database allows one to search for total deaths by all types. The data is very detailed but it isn’t recent. In general the newest data in Wonder is 2 years old. Knowing that 2017 was a “high death” year due to large numbers of flu deaths and that 2018 was a bit below average, I decided to take these two years and average the deaths as my baseline to compare to 2020 data. The data from Wonder can be aggregated across regions (I chose States) as well as by demographics (I chose age in 10 year groupings).
The 2020 provisional death data put out by the CDC can also be grouped in similar ways (states and 10 year groupings). Plus, in addition to providing COVID-19, Influenza, and Pneumonia deaths, it also provides total death numbers for these groupings. This allows for an easy comparison. It is unclear how CDC arrives at these numbers, but they don’t seem to be extremely laggy and they line up more or less with the numbers from Johns Hopkins. Here’s a picture of the website where you can pull the data. As you can see, the claim from the CDC is that the data is as of 10/14.
Since the year is still not over, I’m doing a very simple scaling assuming that the death rate will continue at the current rate for the rest of the year. This isn’t a solid assumption, but I don’t think it matters much. Since we’re in October, 10 months along, I used a scaling factor of 1.2. Back in August (when the data was lagging a bit) I used a scaling factor of just under 2, accounting for 7 months of data.
Changes from July/August Data
Just to cut quickly to the chase, I noticed a number of changes from my last post on excess deaths from August.
You can see the data yourself in the table below (sorted by 2020 excess death percentage). Yellow indicates a state/demographic pair that has low COVID/flu/pneumonia impact (around 15% or less) but still has high excess deaths
I also showed an overall histogram of excess deaths in my last post. This histogram is a type of chart that measures “counts” of samples that fit into a specific bin. For instance, in this case, each sample is a state/demographic pair and the histogram is plotted over 80 bins that range from around 10% of 2017-18 deaths up to around 150% of 2017-18 deaths. So each bin represents roughly 2%. We can see in this histogram that the peak of the histogram is where about 60 state/demo pairs fell into a bin that looks like around 90%. If you see this as the mean and the histogram as a rough bell curve (normal distribution) then you can see that using this method and based upon the CDC’s 2020 death projection numbers, the overall excess death distribution for 2020 has shifted to the left since August (when the peak value was in the bin that represented 110% (go back and look… don’t take my word for it!). This also makes sense knowing that the high death rates from April through June have slowed.
Since I was curious, I wrote code to plot the histograms for each age demographic to see how they related to each other. It’s a bit messy, but you can see in the legend which colors correspond to which demographic. Key takeaways from this visualization is that 1) 35-44 has been hardest hit, followed by 25-34, at least on an excess death percentage basis, 2) 65-74 seems to be slightly below the 100% which would represent the 2017-18 average, and 3) 5-14 and 15-24 have less excess death than 2017-18.
Highly Reported-on CDC Excess Death Pre-print (from 10/20) – take it with a grain of sand.
On October 20th a CDC scientist released a pre-print that the CDC published here. The assessment of the authors, based upon their simulation is that there were 299K excess deaths in the US during 2020. Of course, this was immediately picked up upon by our fearless media. In many cases, they reported on the pre-print incorrectly because the statistics in the pre-print go a bit beyond that of a newspaper data scientist. Actually, the statistics in the pre-print are a bit muddy and don’t seem to line up in places, so I can’t blame the news journalist folks much. I might write a longer report on this paper if I get time, but I’m not confident in their simulation’s assumptions on a typical year-to-year death growth rate and they don’t account for deaths that didn’t occur because a sick person died of COVID first. And their overall numbers don’t match the ones that CDC publishes in the provisional 2020 death numbers either, so this is problematic. I took a stab at replicating their model based on a much simpler and more reasonable regression model than what they selected and their 299,000 number (compared to the 2015-2019 average) appears to represent expected growth in deaths, not excess deaths (see chart after conclusions). We’ll have to wait for the actual paper to come out with all the details I guess. Of course, the Washington Post didn’t wait..
Conclusions
It is tough to make any solid projections based on ANY COVID-19 data. It is always possible that the CDC’s data is inaccurate (it usually is… these kinds of things are infamously hard to measure). And clearly 2020 is a unique year for deaths. It isn’t clear from the CDC’s data that COVID-19 has created significant excess deaths, however.
The really serious question is about the real excess deaths that haven’t slowed down in the younger demographics. This problem is not coming from deaths due to COVID but is likely related to the anxiety and stress created by COVID, by government actions that are aimed at reducing or eliminating COVID cases, by isolation, etc. Unfortunately there is a lot of evidence coming out that these governmental actions haven’t been exceptionally effective (a quick look at COVID case rates across various new government actions shows that they haven’t had very measurable impacts). The other takeaway is that excess deaths for ages younger than 15 have been much less than 2017-18 averages. The combination of being isolated from society (driving in cars less, less exposure to disease, etc.) and the lack of an effect on this group from COVID are likely the cause.
Backup: Tod’s “Simpler” 2020 excess death model
Below we can see the table sorted by the acceleration of the death rate. These are pretty much the only states that are seeing increases of the rate.
Since New York seems to be re-emerging here with above average increases in the Case Rate and Death Rate, here’s their time series plots below, first Case Rate and then Death Rate. The Instantaneous Rate of Change for cases (IROC-Confirmed) is around 1000 new cases per day. For deaths the IROC is about 20 new deaths per day. Both of these values are growing. You can visibly see the Case rate increasing (the cumulative case line is curving upward) but the Death rate increase is a bit too small still to visualize well (but you can see the polynomial fit starting to show the upward curve).
]]>Evaluation of Case Growth Over the Last Week
I notice a handful of interesting things in the data this week.
The Challenges of Understanding Case Growth Accurately
The confusing nature of the latest data from the state is something worthwhile to discuss because I’ve noted news outlets (tucson.com is terrible about this for instance) grabbing the latest U of A numbers, interviewing one U of A professor, and then writing a very scary but highly inaccurate article. It’s even worse now since the numbers are smaller and therefore plagued much more by statistical variation. So here are some thoughts about our current state of counting cases to help you understand what might be really happening.
See the confusion matrix above for the case referenced. Right now 2% infection is a high estimate for just about any community we might sample (Arizona State is indicating that 0.4% of their student population is infected right now). If this number is truly lower, we see a case where nearly every positive result is false. If you take a moment to digest the diagram, you’ll note that the false negatives are very low (the upper right quadrant) where the false positives are about 1/2 of the total positives (lower left quadrant). This is why when a disease is rare (like COVID is — despite all the headlines) sensitivity is relatively meaningless while specificity is critical. The Abbot Antigen test’s specificity of 98.5 sounds great, but in a rare event, it really means that 1.5% of all the people who don’t have the disease (in our case 980 out of 1000) will show up as positive. When we only expect a small number of true positive results (in our case, 2% of 1000, or 20) then the false positives drown out the signal from the true positive. About 1/2 of the people who are told they have COVID in this example actually do not. Hopefully this helps make my case that the state should NOT be including Antigen test results with PCR test results (which since they use DNA/RNA testing to evaluate the presence of the virus have very close to 100% specificity).
Now if you target these Antigen tests in a more focused way, i.e., on a Sorority where you believe a population exists that has a much larger infection rate, then the test will be much more accurate at determining exactly who is infected. This is because there are less “well” people to inflate the false positive count. If the True positives are just twice the number of false positives, the test is now much more useful at evaluating who the sick people really are. BUT, if you deploy it broadly into your broader community the way the U of A is, with thousands of tests per day, the false positives will overwhelm the true positives.
We note that the Instantaneous Rate of Change (IROC) of the curve has now dropped to somewhere around 790. The trend is decreasing, however, as you can note about 4 days in a row where the rate appears to be approaching zero. We have three to four days of anomalous data from about 9/2 to 9/4, where the state appears to have been capturing University Antigen tests as confirmed cases. As the U of Arizona learned, at least, many of these Antigen positive results have turned out to be false positives when checked with a subsequent, more accurate PCR test. It appears from the data that 60-70 percent of the Antigen positive results are false positive. Since this realization, the state appears to only be counting the university cases if they’re confirmed with a PCR test. But not doing this for 3 days or so appears to have inflated our case numbers. Enough on that.
Zip Code Case Growth Update
This map doesn’t look much different than the previous week’s case increase map, except that there appears to be a bit higher numbers in Flagstaff (home of Northern Arizona University) and Prescott (home of Embry-Riddle University). But by far, the top two zip codes in case growth over the last week continue to be the homes of the University of Arizona and Arizona State. This is true even though the numbers of cases reported have dropped a bit due to only recording the cases confirmed with PCR.
Table of Zip Codes
The main thing to note here is that the top two are Tempe’s and Tucson’s University zip codes. Snowflake’s showing up as number three is a bit deceptive. They had 11 new cases this week, but they’ve only had 128 cases to date before this week. The 11 might be from one significant spreading event, or it could just be random noise. The 85009 zip code in Southwest Phoenix has been one that has had a handful of case spikes since Memorial day. The 200-ish new cases in that Zip code could be significant, especially since the Mexico-related infections from a month or two ago seem to have slowed significantly.
Conclusion
Data indicates that COVID-19 might be in the process of burning itself out in Arizona. For now at least… It will be interesting to see if the University cases lead to increased hospitalization numbers in their demographic about a week from now (so far, there hasn’t been any change). With this Zip Code approach above, we can also track if the University cases are spreading to adjacent or other Zip Codes.
1. We see two zip codes with growth far greater than any others. 85719 (U of Arizona) and 85281 (ASU) come in at 38% and 23% growth in cases over the last week. The next highest zip code is in Buckeye and comes in at 7.3% growth.
2. Flagstaff comes in around 4.3% growth. Perhaps they party less at NAU, or maybe there are less cases at altitude?
3. The below map only shows the top 30 zip codes. Most of these are under 5% growth.
4. Right now I’m doing this to see if the university cases spread outside the university areas. My hypothesis is that they will remain contained and the infection will burn itself out in those zip codes. I’ll be watching this and publishing results about every week. I’m also watching the hospital stats closely to see if the university case growth will result in increases in hospitalization.
UPDATE
Apparently, it turns out that some of the numbers from the Antigen tests have been false positives. The U of A admitted this and in doing so, it became clear that positive Antigen tests are going to the university health center to take PCR tests to confirm. Initially, the state was counting all of the Antigen positive tests as positives overall, but that seems to have stopped. Recall my earlier discussions about specificity and false positives. Any time a test has a specificity of around 97 or 98% and the disease is infecting only about 2-3% of the population you’re going to have about 1/2 false positives. See the university’s chart below. If my detective work is correct, all 109 Campus Health tests below were on people who had come up positive in previous days/weeks on the Antigen test. If true, then there’s about a 60% false positive rate (which makes sense based upon the possible specificity of the Antigen test and the rate of infection on campus). Will keep watching this, but it seems less concerning than before.
]]>Here’s from the Arizona State University COVID page.
Takeaways from this info are as follows:
This above chart is a bit different because it shows the cumulative number of COVID-19 cases in the state for each age demographic divided by the total population in the state of that age group. This allows us to see how COVID-19 is really affecting the different age groups. A few things that are interesting…
1) The true rate of infection for 3 groups is pretty much the same. The 20-44 age group always has the most cases by raw number, but when you consider there are more of them than any other group, you can then see that they’re not excessively effected compared to other groups.
2) The 65+ group has less cases by close to 1/2 of the top groups. This makes sense because I’d imagine that many of them are being more careful due to the severity of the disease for those groups.
3) The under 20 group is much less likely to get infected. This may partially be because a good number of people in this group aren’t economically active and schools have been closed. Or maybe their immune systems are better tuned to the disease and they never show symptoms. Remember these cases are confirmed by tests, so there may be many people who never show symptoms and never get tested who have been infected.
4) I’m very surprised at the lack of effectiveness of the state measures taken in late June. Pretty much every county in the state issued facemasks in public proclamations and the economy was essentially closed again. Still, we see no impact on cases for basically 6 weeks, then all of a sudden all the age groups show a marked decrease (the red vertical line). I truly expected the state measures to show a dramatic effect in 2-3 weeks (since the cycle time of the disease ia about 18-21 days). Very strange, but similar to what has been seen in other regions. Sweden (see below) had a sharp downturn in cases just like this and they had very few state measures taken. Makes me curious about what is really causing the rates to make such sudden changes.
5) Testing: The chart below tells the testing story. People have noted to me that media outlets are suggesting that falling case numbers have to do with the decreasing numbers of tests. The way I look at this data is this: First off, testing in Arizona is not strategic and random. People get tested because they feel sick or they work in jobs where there’s a high probability that they could be sick. This means the numbers of tests conducted has a high severity bias. So what this data might be telling us is that every day there are fewer people who feel sick who decide to go get tested and that an even lower percentage of these people are actually confirmed positive with COVID-19. This seems to indicate to me that the decreasing case numbers are probably legitimate.
]]>Results
First, doing the work to connect this data resulted in some interesting insights. Below I show the state demographics sorted by the Excess Deaths in 2020 and we see some surprising things.
What does the table reveal? First off we see that the demographics that have the highest number of excess deaths in 2020 compared to the 2017-18 average are the older demographics from DC and New Jersey. This makes sense due to the large numbers of deaths per capita in these states. We also note from this data that there are clear gaps in the CDC data because we’re not seeing New York at the top of the excess deaths list. Right now the CDC data for 2020 seems to only have about 1/3 of New York’s deaths captured. This is a big liability with using CDC data…
Another interesting thing to note are the rows with yellow highlighting. These are all demographics in states that have had very little COVID-19 death impact compared to the 2017-18 baseline. However, they still have a high Excess Death number. There are many reasons why this might be the case, but I’m suspicious because many to most of these state/age demographic groups are also at high risk from suicide. I wanted to check this by looking at 2020 suicide statistics, but apparently no one has this data. The most recent suicide statistics you can find are in 2018 CDC data.
Histogram of Excess Deaths
Now I want to evaluate what the distribution of excess deaths looks like across all demographic groups in all states. This will give us an overall sense of the probability of having excess deaths in 2020. I do this with a histogram. See diagram below.
This histogram shapes up to look a lot like a Gaussian Distribution with a mean around 110% and a standard deviation of roughly 15%. This means roughly 70% of our demographic groups in the country are projected to have excess deaths ranging from 95% of the 2017-18 baseline all the way up to 125% of the baseline. This indicates to me that yes, 2020 is a worse year for deaths. Based off the data in the table above, we can safely assume that in many regions this is due to COVID-19. The data shows that for some states and their older demographics, COVID-19 is projected to exceed the 30% of total deaths that heart disease consistently accounts for.
Notes:
Conclusion
I got the idea from a post on Linkedin that compared Sweden’s deaths with those in the US and it was really surprising, based on the constant media denigration of Sweden and their modified lockdown strategy. As the data shows above, despite not locking down their under 65 population, Sweden has to date had very few deaths under age 65. Less so than regions where lockdowns of the under65 populations were intense (and in Arizona’s case, happened twice). This comparison also made sense due to population size similarity between the regions (Arizona is about 7.3M, Sweden about 10.2M, this part of NYC is about 8.4M). Another interesting datapoint on Sweden’s unique management of the COVID outbreak is pasted below. Around early July the case rate adjusted sharply and now new case growth is a very small number per day. This is interesting that the case growth slowed so quickly, especially in light of their strategy to not close schools, restaurants, etc.
Population Density
One of the persistent questions about this comparison was whether it had merit since NYC is much more dense than Sweden and Arizona (I assume that’s true, but haven’t looked at the numbers). So since NYC is more dense, it makes some intuitive sense to us that the density factor may account for a greater number of deaths. Does it?
Correlations of different societal and geographical factors with COVID-19 Cases and Deaths has been one large area of interest of mine through this outbreak. I have reported on this in this blog multiple times as the outbreak has spread. In the past, I observed that population density is slightly correlated with case count across the globe but is basically uncorrelated with deaths. Does this still hold today now that the virus has spread to new places?
Correlation of Various Factors with Normalized COVID-19 Death Count
Note that the factor most positively correlated with Deaths in a region is the number of Cases in the region normalized by the population. This is followed closely by the Instantaneous Rate of Change of Cases (the slope of Case Growth). You would expect this to be the case, but it’s a bit surprising to see that the number of cases in a region is only just over twice as correlated with deaths as the Body Mass Index mean for males! This would also indicate that there are regions where the BMI of the population has had more of an impact on deaths as the case count in the region. As evidence that high case count does not always lead to high deaths (and conversely that lower case counts can lead to high deaths, see the chart of Arizona counties, where we have results all over the board. The counties with the highest death rates are generally the ones with lowest population density and highest pre-existing morbidities. Some counties (cities) have very high case counts and low to moderate deaths. Other counties have low case counts and high deaths. It’s all over the map.
Arizona COVID-19 Stats by County
Conclusion
I did the original assessment to compare what has happened in Sweden vs. other regions largely because of the negative media attention that Sweden has received from their COVID-19 lockdown strategy. As it turns out, for populations under 65 (the ones who were actually not on lockdown) there has been very few deaths (but lots of cases). This is surprising considering that in Arizona and NYC, government interventions such as lockdowns, closing businesses, and mandatory face masks have been credited with slowing the growth of the outbreak. There are many surprising things I’ve noticed through this time of COVID. I point out a few others in this post regarding the unintuitive role population density plays in COVID-19 deaths as well as the observation that the correlation of COVID deaths with high COVID case counts is much smaller than we would have guessed (I would have suspected 90% or higher correlation).
Overall what does this show us? Our intuition is not necessarily to be trusted and should be assessed more critically using data rather than prior beliefs. The same applies to media reports, which tend to only show data in support of a pre-existing narrative.
Methodology
First, I’ll explain what a histogram is. Whenever you have data that falls into a certain range, say 0 to 10, you can take a count of the number of examples of that data that fall into bins within that range. The simplest way to bin this 0-10 range would be 0-1, 1-2, 2-3, and so on. This would give you 10 new ranges as your bins. Counting the number of examples in your data that fall between 0 and 1 gives you the number in the y-axis of the histogram (the bins become the x-axis). For many processes, we may see the histogram form that looks like a Gaussian (or bell-shaped) distribution with low numbers in the bins towards the edges and high numbers of counts around the mean (say 4-5 or 5-6). The histogram then gives us a sort of probability distribution if done correctly that can tell us a lot about the process we’re measuring.
So below you’ll see a histogram where I have bins that each represent 2 Cases per 1000. This covers a range up to our highest COVID Cases per 1000 number (around 140). As you can see, the highest counts cluster in the bins toward the left side of the chart. This resulting histogram (the gray bars) looks like the discrete Poisson distribution and the shape of the distribution can be modeled as an exponential decay (the red line). This is pretty interesting because I’ve found that the slope of cumulative case growth is best modeled with a third order polynomial, but the exponential decay is a much steeper slope than a polynomial. I’m curious about what this might be indicating, but this is the same type of process as radioactive decay.
The formula for this exponential decay is y = a*(-b)^x + c , where a represents the original amount, b represents the amount of change (note that since this is decay, b is negative), x in this case represents the growth in cases per thousand, and c is a constant. The b parameter is a measure of the steepness of the curve at any position x, so it is interesting to see how b changes over time.
You can see the values of a, b, and c in the upper right of the graph below. This is the most recent histogram. We can see that there is a steep decay down to the asymptote where we see counties with more than 60 cases per 1000 to be somewhat of a black swan event.
Now we’ll look at the histogram from 2 weeks earlier on 7/18. As you can see the b value is a bit higher, which makes the slope a bit steeper.
Here’s the histogram from 7/4, one month earlier than the top chart.
And the histogram from 6/4.
And finally 5/4
Conclusion
Overall, what I note in this data is that the probability of counties with large numbers of cases per 1000 is increasing over time. The trend on the steepness of the exponential decay curve that fits these Poisson distributions is that it seems to half every month. This is also an exponential decay signal in itself. Interesting…
However, there does appear to be some fundamental limiting factor based on the total number of cases in the country. The exponential distribution has a finite variance, which limits surprising “black swan” events in the tails of the distribution. The fact that the counties with large numbers of normalized COVID-19 cases are rare and that this trend follows this distribution and is best fit with an exponential decay curve indicates that the system that generates COVID-19 cases in counties (a system which includes natural and geographical features, societal control features, and cultural elements) naturally limits the cases. At least this is what the data has shown so far.
However, there does appear to be some fundamental limiting factor based on the total number of cases in the country. The exponential distribution has a finite variance, which limits surprising “black swan” events in the tails of the distribution. The fact that the counties with large numbers of normalized COVID-19 cases are rare and that this trend follows this distribution and is best fit with an exponential decay curve indicates that the system that generates COVID-19 cases in counties (a system which includes natural and geographical features, societal control features, and cultural elements) naturally limits the cases. At least this is what the data has shown so far.
Update – 9/6
The peak of the histogram has shifted to the right as more and more counties have experienced COVID case growth. However, the exponential fit of the slope (-b) from the peak downward is still in the same ballpark as it was a month ago. What does this indicate? I’m not completely sure, but it seems like the fundamental nature of the ecosystem (the world, the US, political systems, etc.) that generates “cases” remains consistent. Outlier counties in normalized case count are still very rare.
]]>