todnewman.com

July 20, 2020July 21, 2020

COVID-19 California Update

I saw an interesting article in the LA Times that indicated that the suburban counties of the LA area were seeing case growth higher than LA county. This seemed interesting, because the last time I looked it was no where near the case. I investigated, and thought it might be good to write about, because it’s another example of the lack of data science knowledge in the people that are influencing our opinions with their reporting.

The actual Data from California

Here’s the standard US State data table I use. This is the top 14 counties in California sorted by the instantaneous rate of change of their Confirmed Case curve. This is a measure that allows us to find the slope of a curve at any given day. I prefer this approach to averaging the last 7 days of cases, which is what a lot of the media are doing. This post will point out why the averaging approach can be misleading.

So What Do We See Here?

First, it seems like Riverside and San Bernardino Counties both have a higher Case rate of change than LA county. You can see that by their ranking. But it does seem like LA County still has a higher rate of change than Orange county, but by just a little bit… We can also see that LA county has a pretty high cumulative number of cases per 1000 persons. This is measured since the outbreak started whereas the rate of change just measures the current trend. So we would assume that San Bernardino and Riverside counties, who both have lower cumulative numbers than LA county, were probably much lower a while back, but are now catching up.
Cumulative deaths per 1000 are very low across California compared to back East. LA County has significantly more deaths per 1000 persons than Riverside, San Bernardino, or Orange County, but also significantly less than Chicago or New York City.
Imperial County is still the most impacted county in California. This is a border region near Mexicali, Mexico, where there is a very large outbreak right now. I have read that Imperial County hospitals were quickly overwhelmed with US Citizens who live in Mexico but preferred to be treated for COVID in the US. Many of these people have been transferred to LA area hospitals. Imperial County has the largest number of infections and deaths per 1000 people in the state and also has the highest case rate of change.

Why Does the Article Indicate that COVID is spreading Faster in Orange County?

The article indicates that the Inland Empire counties and Orange County all rescinded their mask orders and reopened earlier than LA County. The conclusion is that their case growth since then has been due to this. I think this is a fascinating natural experiment, but am not sure if there’s enough information to make this conclusion. Here’s the chart that the LA Times uses to make their case:

This looks pretty compelling at first. As with any chart, you need to read the fine print. They’ve been averaging the number of cases over the last 14 days and LA county has held pretty much between a 14 day average of 300 and 400 since 7/1. You can see that the trends for Orange County and the Inland Empire Counties have both increased faster than LA County. But taking a 14 day average of cases might present some problems. If I had 1000 cases each day last week and 200 cases each day this week, my 14 day average will be (7000 + 1400) / 14 = 600. There’s nothing wrong with that number, but clearly it doesn’t tell the story of my region. The story is that I had really high cases last week and something miraculous happened and my cases went way down this week. You can’t tell that story when looking at a 14 day average. Here’s a picture of what has happened in LA County and Orange County that kind of goes against the LA Times narrative:

Orange County Confirmed Cases from 3/18 to 7/17/20

LA County Confirmed Cases from 3/14 to 7/17/20

See what happened in the Orange County case curve? Starting about Memorial Day, they had explosive growth in Cases, which started flattening around 7/10. There’s a pretty impressive shift in case growth from that day until the present. About half of the 14 day window the LA Times uses to average is “explosive case growth” and the other half is the new rate that they’ve maintained for a week. Therefore, if they had chosen the more standard 7 day average window (or even better, my preferred approach of instantaneous rate of change) then their chart would have told a completely different story. Who knows why the LA Times chose 14 days? Perhaps it was to make a political point or possibly (my preference) they just didn’t have the experience in house with data science to question their numbers and understand how to present data.

Conclusion

Even though I make this point on the LA Times data science approach, the fact does remain that the suburbs do seem to be catching up to LA county, and this is concerning. Like in AZ, where I have done quite a bit of research into the details of this recent surge in cases, there is probably a very complex story that can’t be simplified into a headline.

Backup – Inland Empire Counties charts

Note that we don’t see the same flattening in Riverside and San Bernardino that we do in OC. Also, the Death curves are below. When normalized by population, LA County and OC are about equal on the Instantaneous Rate of Change for Deaths. Riverside County is the highest for these four counties and San Bernardino has the lowest death rate of change (by a factor of 2).

Riverside Counties Cases from 3/24 to 7/17

San Bernardino County Cases from 3/26 to 7/17

July 14, 2020

COVID-19 Update: Top Case Rates in the World are All in Very Hot Places!

Table showing Regions ranked by rate of growth of cases per 1000 persons – 7/13/2020

Just a brief post to show who has the largest case rates. Arizona is in the lead. Note that everyone at the top of the list has very hot weather right now! Some, like AZ, Bahrain, Oman, Qatar, Nevada, Chile, and Utah have significant desert regions. Others like Panama, Florida, Louisiana, Texas, and South Carolina are hot and humid. This is probably not coincidence…

July 13, 2020July 14, 2020

COVID-19 Update: Arizona Cases, Deaths, Hospitalizations by Age Grouping

I’ve shown raw numbers of cases by age demographic in the past. Here’s a different way to look at it. In all of the charts below I have normalized each age group’s numbers by the total numbers of members of that age group in the state. For instance, the 65+ line in the “Cases per 1000” chart represents the numbers of people in the 65+ category who have been confirmed as having COVID-19 divided by the number of thousands of 65+ people in the state. Therefore, you can see that as of yesterday, there have been just over 20 Confirmed Cases per 1000 persons over 65 in the state. This is the same as saying that 2% of all persons in Arizona over 65 have been formally diagnosed with COVID-19 since the start of the COVID era. This doesn’t mean that 2% have it right now, of course, as these numbers are cumulative.

Arizona COVID-19 Confirmed Cases per 1000 Age Group Members

Arizona COVID-19 Hospitalizations per 1000 Age Group Members

Arizona COVID-19 Deaths per 1000 Age Group Members

What you may notice in the top chart (“Cases”) is that three of the groups have been tracking together since about 6/11 (note that since I have to collect this data from the state dashboard by hand every day, I only have back to 6/11, the day I started this practice). These groups even seem to follow the same exponential curve. When not normalized by the age group, these three all look much differently due to the difference in the populations (there are many more 20-44 year olds). This makes the case that the virus affects this range from 20-64 in a very similar way. You’ll also note, however, that the 65+ group and the <20 group are very different. This is interesting so lets reason about this.

Why are the 65+ and >20 groups different regarding Case rates?

I think there are two different stories here. For the 65+ group, one just has to look at the deaths chart to note that they are the group that is much more at risk than any other. This is widely known by all including people in this age group! My guess is that they are the age demographic that is being the most cautious about COVID. If true, then their efforts (distancing, masks, etc.) appear to be effective. The second group that is being affected less as a proportion of their population is the under 20’s. This grouping is a bit unfortunate, as it includes both children and adults (I wish they would provide case numbers for under 13 as well as 13-20). Regardless, it does seem like this grouping is far less likely to get infected with COVID-19. My suspicion is that most of the infections that do occur in this group are in the 17-20 age range, but I can’t prove that. I would also guess that the significantly lower incidence of infection in this group is due to a combination of the lower mobility that people under 16 have and a more optimal immune system response.

Case Growth Linearity

Although the case growth appears non-linear when viewed by raw counts, only the groups between 20 and 64 appear non-linear when normalized. The 65+ and under 20 groups both appear to be mostly linear. What this means is that the rate of growth (i.e., 40 cases per day) stays consistent and doesn’t go from 40/day one week to 50/day the next week and so forth. This is interesting and probably indicates some sort of resistance to infection, either through natural causes or practical effort.

Government Measures to Slow the Rates

There have been a number of measures taken by state and local governments during this period that have attempted to slow the case growth rates. Many of these started on or around 6/19/2020 when both Maricopa and Pima Counties unveiled new mask ordinances. My expectation was that I would see case rates decrease (i.e., “flattening the curve”) about 10 days after the measures were put in place. As you can see, there is very little indication of any effect yet on case growth (or hospitalization or deaths). There is a slight slope change in Case Growth over the last 2 to 3 days, but there’s a good chance that’s just due to data collection issues. From my experience, I don’t trust any trend in the data that I can’t see over a full week.

A few reasons that might explain the lack of an effect from the recent COVID state and county ordinances:

Possibly people in these regions (especially the most affected parts of these counties) are not being consistent in their compliance to the new rules. In the areas around my home (a zip code that has been very lightly affected by COVID) I observe very impressive compliance, but I’ve noticed in other regions of the state that there is significantly less diligence around COVID-19 safety measures. How to measure this kind of compliance is interesting and this would make a good social study…
Perhaps it still to early to observe an effect (and hopefully the three-day trend we can start to see will become more pronounced). My assumption that the lag between infection and symptoms and the lag from positive test to data being published would sum to something less than 2 weeks may have been wrong. If so, then we should be able to measure an impact and evaluate the lag times at some point in the future.
It’s possible that there is a different method of transmission that we’re not considering where people are getting infected in times when their masks are off and/or their social distancing is more lax. I’ve been thinking about heating, air conditioning, and ventilation for quite a while and was studying the aerosolization of the virus very early on. This might be one example of an unexpected transmission route. If this is true, then it might mean that the CDC and the State/County/City governments may need to re-evaluate their recommendations.

One Arizona county where there may be some measurable effect due to the Government measures is Pima County. It’s not dramatic, but it is a visible change in the slope of the 20-44 age grouping approximately two weeks after the county mask ordinance and new state measures were ordered. See chart below where I show raw case counts (not normalized… this is why it looks different). Unfortunately this effect is not seen very clearly yet in Maricopa County or Pinal County data.

Pima County Cases by Age Demographic – raw case counts – 7/13/20

Hospitalization Curves

It is interesting how the different age groups are seeing clear separation in their age-normalized hospital rates (whereas 3 of the groups had pretty identical case rates). Each age group (except the under 20 group who is seeing little growth at all) is seeing mostly-linear growth in hospitalizations, but based on evidence from people working in the hospitals (the hospitals don’t publish data they’re not required to publish, so we’re left with stories from their workers) the combination of these is enough to overwhelm the limited resource of hospital beds and staff.

Death Curves

Deaths have been increasing over the last week or so and you can see that one of the reasons is that there is a slight acceleration of deaths in the over 65 group. To date, this group has accounted for over 85% of the deaths in the state. The fact that their death rate has accelerated over the last 10 or so days is concerning. Not really sure what to attribute this acceleration to (hospitalization was pretty linear for the two-three weeks prior to the acceleration starting).

July 11, 2020July 11, 2020

COVID-19 Update: Arizona Summer Outbreak Distribution by Zip Code

It has been interesting to see that the distribution of Confirmed COVID cases in AZ has followed the Pareto Principle. This principal is also sometimes called the 80/20 law and essentially refers to a process where 80% of the effects result from 20% of the causes. There are lots and lots of examples of this in nature and society, such as:

20% of criminals commit 80% of crimes
80% of the world’s wealth is held by 20% of its population
Microsoft found that 20% of Software bugs resulted in 80% of the crashes
20% of drivers cause 80% of all traffic accidents
80% of pollution originates from 20% of all factories
20% of a companies products represent 80% of sales
20% of employees are responsible for 80% of the results
20% of students have grades 80% or higher

Why is this principal useful? Not all issues follow this principal but a surprisingly large number do. Lots of business gurus have a strong intuition for problems that might be Pareto problems, because that gives them an easy place to attack (the 20%) in order to realize lots of gains (the 80%).

How does this apply to the current summer Arizona COVID-19 outbreak?

The map below shows the 20% of zip codes in Arizona that account for 80% of the cases (by the way, I checked, the top 20% of zip codes for cases per 1000 people also comes out to 80% of the cases). If I were running the state response, these are the zip codes I’d be focusing on. Probably my start would be to flood the areas with low- or zero- cost tests so I’d know exactly where the outbreaks are occurring in those regions and hopefully what the transmission vectors are. Perhaps that’s actually what is happening. If true, of course, it reinforces the perception of the problem because now testing would be non-uniform, focused mostly on the problem areas. But in the world of limited testing resources, I suppose this is the least-bad problem.

The 20% of Arizona Zip Codes that have accounted for 80% of the state’s COVID-19 Cases. Color of bubble represents number of cases (red=Lots, dark blue=Least) and diameter of bubble represents zip code population.

Analysis

First, you can see that the zip codes do correspond with large population centers. This makes sense since in this chart we’re evaluating raw numbers of cases. The light green and red colors inform us where the larger outbreaks in these population centers are. You can see that SW Phoenix, South Tucson, and the border regions (Yuma, Nogales, Douglas) are the highest affected areas. We also know that most of the cases in Arizona are people between 20 and 44 years old. One can assume that the 20% zip codes with the most cases may have even higher 20-44 representation than the rest of the state. I wonder if we’ll see these regions hit “herd immunity” (if that exists for this virus) earlier than the large numbers of zip codes that have low numbers of cases?

I have read multiple reports recently that the Arizona outbreak is primarily a mutation of the coronavirus that is characterized by high transmission and low mortality. I have no idea if this is really true, but if so, it does explain why the state has had so many cases with a correspondingly low number of hospitalizations and deaths (of course these are still occurring, but they’re only increasing on the order of 20-150 per day whereas cases have been increasing on the order of 2000-4000 a day for a few weeks now).

This makes me curious if this is the same mutation that is rampant in Northern Mexico right now. There is evidence (just see the map above) that areas with lots of essential travel to Mexico have very high numbers of COVID-19 cases. Sonora, Mexico, just over the border from Arizona has very, very close ties with our state. They are advertising just over 10,000 cases (to Arizona’s 116,000) with 987 deaths (to Arizona’s 2151). Arizona’s population is over 2.5 times larger than Sonora, so if you scale these numbers by population you see Nogales reporting 3.5 cases and 0.34 deaths per 1000 people. Arizona is reporting 15.9 cases per 1000 people and 0.29 deaths per 1000. If all things are equal, this indicates that Sonora is likely under-reporting their cases by about a factor of 5 (at least… I hold that Arizona’s case numbers are at least 2x too low due to testing bias). This makes sense, as Arizona is testing like crazy and finding a lot of cases that are less-symptomatic, whereas I imagine Sonora may not be doing this. The spread of the virus between the US and Mexico (probably both directions) is also interesting because it reflects the culture of constant commerce and relationships between Mexico and Arizona and gives insight into how this virus travels a very complex route.

July 2, 2020July 2, 2020

COVID-19 Update: US Cumulative Numbers and Current Rates by Latitude

Haven’t posted on this for a while specifically for US states… Reminder that the blue/salmon chart is cumulative numbers (i.e., since day one), so it reflects counts from the whole epidemic. The second chart (green/orange) represents current case and death growth rates.

Therefore, the first chart shows the total damage by latitude but the second chart shows the current hot-spots.

US Cases per 1K and Deaths per 1K by Latitude band – 6/30/20

The first chart doesn’t tell us what we don’t already know. Since about February the latitudes from 40-45 have taken the brunt of the COVID-19 outbreak and experienced the highest number of cases and deaths. These numbers include the recent outbreaks in the southern parts of the US, so despite that attention, they’re still lagging way behind the Northeast in total case and death count.

The second chart tells us part of what the newspapers have been saying. Latitudes 30-35 (Arizona, S. California, Dallas) have seen very high case growth but very low growth in deaths. Same for 25 to 20 (South Texas, Florida). The hardest hit regions in the Northeast, however, are seeing very little case growth or death growth. Lots of thoughts as to what this represents, but very interesting to see that the latitude effect is still in place, the latitude bands are just starting to shift. One thing that I observe, however, is that during the months the virus raged in the northern latitudes, the temperature was cool, heaters were on, and people were indoors (more or less). Now in the southern latitudes we see the reverse. Temperatures are over 100, air conditioners are on, and people are indoors.

July 1, 2020July 1, 2020

COVID-19 Arizona Outbreaks by Zip Code – Correlation Study with Recommendations

During the early phases of COVID-19 I did some studies to find if there were significant measurable factors (smoking rate, diabetes rate, population density) that had high correlation with COVID case counts or death counts. These studies were revealing and sometimes identified interesting factors that have subsequently emerged as topics of interest regarding their relation to COVID-19.

The current accelerated wave of COVID-19 cases in Arizona has been very interesting from a data science perspective due to the total lack of uniformity of the spread. Hotspots for the virus have included border communities, native communities, inner cities, dense suburbs, and occasionally retirement communities. The inconsistency of spread in these types of regions has been surprising. Some border communities have been overwhelmed while others (Cochise County) have had few cases reported.

AZ Factor Correlation by Zip Codes

This study is to identify if there are factors that we have measured across AZ zip codes that might be correlated with Cases and Normalized Cases (cases per 1000 persons).

Correlation Matrix for Interesting factors across Zip Codes and COVID-19 Cases and Cases per 1000 persons

The above is the overall correlation matrix. Looking across the “Cases” and the “Cases_1000” rows will give the visual impact of correlation of other factors with these target features. See below for a numerical view of the correlations with these two targets.

Correlations of Factors with 1) Cases per 1000 people and 2) total number of Cases

Does this Tell Us Anything?

Question: where did the data come from? I pulled the cases by zip code from the AZ DHS COVID dashboard. The correlating factors are mostly pulled from usa.com, which is a real treasure trove of data from Census data or the American Community Survey data. I built up a dataset by merging these data on AZ Zip Codes. This is useful because it’s a view into a smaller regional area. Maricopa County has hundreds of zip codes and only a dozen or so could be said to be hard-hit with COVID-19 cases. Analyzing at the Maricopa County level, therefore, might not reveal much insight.
Question: Why are the numbers so different between Cases and Cases per 1000? To think about this, you may want to imagine a region with a very large population who is seeing a large number of cases (we’ll call this a “Type One” community) and comparing it with a region with a much smaller population (say, 1/10th as large) that is seeing half the number of cases (we’ll call this a “Type Two” community). There is something noteworthy going on in each. In the first region, you have lots of people’s lives being impacted, but the effect may be uniformly distributed across the broader community. Perhaps the situation isn’t devastating to any particular portions of the community. Some groups in this region may even be completely unaffected despite the large number of affected people. Now compare this to the Type Two region who has less cases but a higher percentage of infection. In this region, the situation may be related to a large factor unique to that community. A good example of this are some of the smaller border regions in AZ, where there is a big, notable causal element (cases in Mexico? ) driving the case growth across the whole region. This might help understand the differences between the correlations for Cases — which may be a more interesting measure for the Type One community and Cases per 1000 — which reflects on the Type Two regions that have been more broadly penetrated by the virus.
Analysis of Cases per 1000 correlation: POSITIVE CORRELATIONS: I notice two factors that have greater than .10 correlation with cases per 1000. These are “Use of Public Transportation”, and “Percent of Zip Code that works in the Transportation Industry”. Both of these are interesting because we all recognize that transportation is a centralizing function that may well be also transporting the virus between people. In NYC and NJ, there has been speculation since March that mass transportation was allowing the virus to spread faster. Similarly, people working in the transportation industry have been hit with Coronavirus outbreaks. In Tucson, the biggest outbreak outside care homes and prisons has been at the central UPS warehouse. Its possible to imagine that Zip Codes that have a higher percentage of people relying on public transportation and have more people who work in the transportation industry (truck drivers, UPS/Amazon delivery drivers, Airport workers, etc.) may be more heavily affected by COVID-19 as a percentage of their overall population. One other positive correlation is the percentage of renters in the zip code. This seems like it might be a way to measure the connectedness of living arrangements. In particular, it does seem like zip codes with large numbers of apartment housing have higher COVID-19 case counts. This might indicate that is a real relationship. NEGATIVE CORRELATIONS: Education, Median Age, Median Income. I had noticed early on a relationship where the zip codes with lower median income had much higher COVID-19 case counts. Indeed, these areas also seemed to be continuing to grow at faster rates. All three of these factors could be interconnected and may represent some causal element behind COVID-19 cases that is related to poverty. Median Age is generally lower in areas with low median income (and is likely one of many causalities for low median income). Same applies to Education. In this study, however, lower levels of education seem to be more strongly related to higher cases of COVID-19 cases per 1000 persons than any other factor.
Analysis of Correlations with Raw Case Count: Population and Density are at the top of this list, which makes sense (and doesn’t mean much). Regions with larger populations are typically more dense and therefore have more people get COVID-19. This describes the Type One community situation. Not much analysis is needed here. However, the percentage of renters pops up very high when correlated with raw cases. This makes the case that a zip code with a high number of renters (and probably large numbers of apartments) is going to be more likely to have a high number of COVID-19 cases. The renters measure seems far more related to overall high counts than high percentages of the population getting infected. It is also interesting to note that the correlation of case count with percentage of public transportation users in a zip code is much lower for raw case counts than it was for cases per 1000. This tells me that a high percentage of renters is driving raw counts of COVID-19 but something else is driving Cases per 1000 people. So maybe we have two different kinds of COVID-19 situations. “Type One” is the large city with large numbers of apartment dwellers… this may be a place with lots of interactions and difficulty in distancing. The second type of community (“Type Two”) is one with a large causal driver (the border, a non-distancing culture, a meat packing plant). It would seem like a single solution may not impact both types of community equally. In the Type One community (high raw counts) Education and Median Age have flipped… it would seem like the median age of this kind of region has grown in importance about COVID-19 whereas education’s correlation is about the same. So my cursory analysis would be that renting and low median age are related (makes sense) and are pretty closely tied to reasons behind COVID-19 in the “Type One” region whereas Education levels and Public Transportation are causal in both types of community. Having lots of workers in the transportation industry is less correlated with COVID-19 for Type One communities than it is for Type Two communities.
Boring things to note: Having lots of people in Service Occupations seems to have low to no correlation with COVID outbreaks in either kind of community. Therefore, I’d surmise that attempting to manage service industry companies and workers with a goal of preventing spread will have very low impact. Communities with high numbers of manufacturing workers seem to have higher numbers of both raw COVID-19 case counts as well as Cases per 1000 numbers. But the correlation is pretty low. This may make the case for governments to maintain some sort of oversight of the healthiness of manufacturing operations, but this data would indicate that closing manufacturing jobs won’t significantly prevent COVID-19 cases. And obviously, an increase in the number of “information workers” results in lower numbers of COVID-19 cases in both Type One and Type Two communities.

Conclusion and Recommendations

First off, my data is getting more granular and informational as I go down into zip codes, but there’s still lots to learn (and more factors to collect). I show manufacturing jobs aren’t driving COVID cases, but I don’t know of the existence of meat-packing plants and if that even qualifies as manufacturing, for instance. However, these results are interesting. It might begin to indicate what knobs to turn to “dial back” outbreaks. Recommendations:

By my estimation, it doesn’t seem likely that gyms and restaurants are appropriate knobs for slowing COVID growth. There’s little effect with high percentages of service employees in a region and wealthier communities (who may be more likely to be going to gyms and restaurants) are seeing much lower case levels.
To control raw numbers of cases, maybe a good knob to turn would be to investigate the role of high-density housing and public transportation and attack root causes that emerge.
For Type Two communities that the data reveals must have a larger causal problem, investigation into the unique qualities of that community might be a more effective intervention than broad shutdowns of their economy.
And finally, the data does show that low education is related to high cases, so doing a better job at educating the communities in relevant ways would be a strong play.

June 27, 2020

Arizona COVID-19 Update – 6/27/20

Chart comparing Pima and Maricopa County Confirmed Cases over Time – 6/27/20

The chart above is the one I’ve been thinking about putting together for quite a while now. It’s really busy, but it has a ton of information in it. Here’s how to read it.

Normalizing Case Counts by Population: I’m comparing both Pima and Maricopa counties (the two largest in the state by far) on a cases per 1000 basis. Why do I do this? If I compare them on raw numbers of cases, it looks like Pima County is doing SO much better than Maricopa because they have 1/4 the cases. However, Pima also has 1/4 the population. This is one way the news media exaggerates stories, probably because it looks stark and dramatic when you don’t compare appropriately. When I do this the right way, you can see that Pima and Maricopa had the same exact slope (more or less) up until Memorial Day. This is the purple arrow. After memorial day, we see case growth accelerate in both counties but a good deal faster in Maricopa County. This is the source of much of the overall case growth in Arizona.
Polynomial Trend Lines: The fat, light blue and pink lines are the trend lines for Maricopa and Pima respectfully. These are both modeled with 3rd Order Polynomials, which essentially means that the formula to create the trend line is something like Ax^3 + Bx^2 + Cx + D. This essentially shows that the case growth is accelerating (curving upward). Almost every state’s case growth right now can be modeled with this same kind of function. The trend line allows us to do simple predictions for the next few days on what the case growth might be. It is not a good predictor for much more than a few days out because the situation is too complex for that.
Testing Numbers and Results: The yellow dashed line represents the numbers of tests on each day. I had to pull these numbers from the state’s online Dashboard manually because they won’t let us download it. So the data may be off by 5-10 tests per day. Note that the Test Numbers are represented on the secondary Y-axis (the one on the right). This can be confusing, but it allows me to provide more valuable visualizations. I also tried to capture the weekly percent positive for the tests. As you can see the percent of tests that are positive is growing. I’ll try to offer some possible explanations for that in my conclusion.
Data Lags: Note that I extend the blue “Stay at Home” rectangle about 10 days past the 5/17 expiration date. I do this to demonstrate that most of the data we see every day has the potential for being as much as 10 days old. Data collection isn’t very clean and efficient when dealing with health-related issues. Any time you look at COVID-19 data, whether it is the CDC or the WHO or the AZ DHS, you need to remember that it’s likely reporting the state from a week or so beforehand. I’ve seen some embarrassing data analyses during this outbreak by professional media that did not account for the fact that recent data is likely to be underreported due to this lag. The testing numbers above are a good result. I have no reason to believe that AZ has slowed testing in the last week. We just don’t have the accurate numbers in yet.
Events/Triggers: I’ve labeled various events and triggers on the chart. The stay at home order and its expiration are interesting, as is Memorial Day. Face Masks became mandatory in AZ on about 6/20. You can be sure that will be added as an important event as the time goes on and more data comes in. My expectation is that we’ll see some sort of change in the trends in late June or early July (to account for the data lag but also the 14 day hospitalization cycle time).
Hospitalization: I’m not including hospitalization stats in this chart, largely because first, the chart is already too busy, but second, I have a hard time trusting/believing the states and counties hospitalization data, which all seem to contradict one another. Suffice it to be said that right now hospitals in the state are jam packed with COVID cases and there’s not much margin (at least in the traditional sense).

Analysis of the Chart

Comparison of Pima and Maricopa Cases per 1000: As mentioned above, it’s very interesting to me that the case growth up until Memorial Day in both counties is essentially linear and basically the same slope. Ending the stay at home order doesn’t seem to have dramatically impacted this case growth rate (even considering the data lag). Two events seem to have occurred simultaneously that may be causal for the dramatic case growth lately. First is Memorial Day. We see the exponential case growth start a few days after Memorial Day. It may well be that a number of people contracted the virus during Memorial Day activities (we’ll probably never know if the protests/riots contributed due to bad data on those events). Maybe there were super-spreader events during the holiday too that we haven’t identified. The second major event that certainly contributes is the doubling of testing that also started about this same time. The state was conducting an average of about 8K tests per day up until about June 4th when it doubled this to an average of about 16K tests per day. During the stay at home period there was an extreme bias in the tests toward sick patients because one could barely get tested unless they exhibited symptoms. Even then, only about 5% of tests were positive. The spreading that may have occurred around Memorial day combined with the doubling of testing have resulted in not just doubling the number of cases, but exponential growth, because now the percent positive rates are approaching 20%.
Why are the Tests’ positivity rates so high? This is interesting to think about but here are a few possible reasons. First: There is a lot more virus out there now since Memorial Day and people are catching it. One telling stat is what I have shown a few times (which still holds) that shows that the growth rate of infections in the 65+ community is still the same now as it was during the stay at home order. In short, this demographic is still travelling down the same purple arrow! All other groups are reflecting the exponential growth trend. It is likely that the 65+ community is being just as careful now as they did during stay at home orders (and maybe group homes have also become more careful) and they’re avoiding the bloom in the virus. Everyone else is exposed to a larger population of the virus. This is speculation, but it makes sens. Second: It may well be that there is emerging another kind of testing bias and now people who are more likely to be infected are more likely to get tested. For instance, since I can’t see WHERE the tests are being conducted, there’s a chance that a higher percentage of tests are coming from regions that are already having major outbreaks (border counties, native communities). This is possible, especially given that there appears to be clear indications that the virus is more prevalent in some areas than others. The only way to really prevent this bias is to do what some European countries did and randomize testing. Otherwise we have no real idea of what’s happening. Third point: I’m convinced that we’re not seeing issues with false positives on the PCR tests (but I still believe there are high false positives on the antibody tests that make them somewhat less informational right now).
Why are the Rates different for Pima and Maricopa County? First, one thing we’re seeing is that the rates can be very different in different regions. Not just across the world or across US states, but even by AZ Zip codes. After about 3 weeks of tracking this I’m still seeing the less wealthy zip codes have the highest overall numbers of cases per 1000 people AND the highest growth rates over time. This is interesting to analyze because it makes one curious about why this is happening. There are a number of hypotheses for this. It’s possible that people who are overall less healthy (maybe they don’t have good health care) may be more likely to get infected and then need to seek medical care. However, it does appear that this isn’t a very solid hypothesis when one looks at the demographics where the largest number of cases by far is in the younger, healthier age groups. Culture is one hypotheses I hear, for this, where the cultures in less wealthy regions have evolved to rely on others much more than the cultures in wealthy regions require. There are also ethnic cultures and traditions that may have some causality. Also, based on this evidence, some of the activities that are more commonly engaged in the wealthier zip codes (dining out, going to the gym, etc.) may actually be less causal of infections than we thought. From my observation, also, the culture of mask wearing in Arizona is stronger in the wealthier zip codes than in less wealthy or rural zip codes. It’s possible that this has an impact, but time will tell how significant of one. Regardless, there’s still much to learn about this.
Case Severity: With this virus, just like with the flu, there is a very wide range of severity. Measuring cases is interesting from a numbers standpoint, but it is not a good representation for the severity of an outbreak. A very large majority of the new cases we’ve been measuring (and in some cases, stressing about) are asymptomatic (or low-symptomatic) cases that aren’t requiring medical attention. The better measure of severity is deaths, of course, but also hospital cycle times and capacity measures (because they’re leading indicators for deaths). The hospital measures are extremely hard for a number of reasons mentioned in an earlier article on this site… Hospitals and their staffs are clearly being stressed with the growth in severe cases (even thought this growth is very small compared to the asymptomatic cases). Some of this is because this disease forces a 2-plus week cycle time on cases, something that appears to be extremely unusual for viral infections.

The Effect of Wealth on Cases in a Region

Cases by AZ Zip Codes. Sorted from Lowest Median Income (left) to highest – 6/27/20

Above is the latest comparison of COVID-19 cases per 1000 population compared to median income. Note that the lowest median income zip codes is on the left and the highest is on the right. The average number of cases per 1000 for the poorest 25 zip codes is 9. The average for the wealthiest 25 zip codes is 2.6. You can see the yellow trend line shows a decrease in cases from the left to the right (case counts are on a logarithmic scale on the right y-axis). The red line are actual cases per 1K for the zip codes. Note that you may not see your zip code labeled on the chart (only about every 10th zip are labeled because otherwise the chart would extend around the room!).

Case growth follows this exact same trend. This means that the regions with the highest rate of change in their case counts (hot spots) tend to be on the left of this chart. This indicates that the overall trend of more cases in less weathy areas is not changing.

The Effect of Population Density

Cases by Population Density (by Zip Code) – 6/27/2

One explanation for the “wealth effect” is population density. This makes sense in light of the now-ubiquitous 6 Feet of Separation. Many of the lower income areas with high outbreaks are in zip codes that are known to have large numbers of people living in dense environments (apartment complexes, for instance). However, some of the regions with the highest outbreaks are rural and agricultural regions that have very low population density.

Overall, however, the chart above does show that the cases per 1K tends to go up as population density increases. The trend line is fit with an exponential function that has a decent (but not ideal) fit. Most likely, density is one component of the problem, but is likely not one of the larger components.

The Effect of Median Age

COVID-19 cases by Zip Code Median ages – 6/27/20

Another interesting characteristic of some zip codes that may be driving higher case counts is median age. This makes sense, especially since we already know that most of the cases in the current outbreak are asymptomatic cases among younger people. Therefore, this chart tells a very clear story. Outbreaks are much higher in regions with lower median ages.

June 24, 2020

COVID-19 Update: The latest on Cases by Latitude

I’ve been fascinated throughout this outbreak at how it breaks across latitude ranges. Here’s the latest info on Latitude.

COVID-19 Cases by Latitude Range – Cumulative values of Cases and Deaths per 1000 – 6/23/20

The above chart shows cumulative numbers normalized by the population for each region. This shows the upper North latitudes continue to lead in total cumulative cases and deaths per 1000 persons but a couple of the South latitude ranges are starting to catch up (Brazil, Ecuador, Chile, Argentina, South Africa).

Notably, the middle latitude ranges are still far less affected cumulatively.

COVID-19 Instantaneous Rates of Change for Cases and Deaths by Latitude Range – 6/23/20

Now above we see the instantaneous rates of change across the latitude bands. This chart shows us where today’s hot spots are. Note that the US latitudes are very low because overall, US cases are low compared to population. Most of the US cases right now are happening in the 30-40 N Latitude range (Arizona, Texas, California), but the rates in these locations are not large enough to show a significant spike in instantaneous rates (which are spiking for Africa and South America across the board right now). Note that 20-30 N. Latitude are starting to show increased slope in cases and deaths. This is largely due to a growing number of cases in India and Bangladesh.

June 20, 2020June 21, 2020

COVID-19 Update: The Troubles with Hospitalization Data

Hospitalization data right now seems to be one of the most critical signals that a COVID-19 outbreak in a region is getting serious. However, hospitalization data is really hard to analyze for a number of reasons:

Hospitals don’t like to share data. In many cases in the United States (including Arizona) there was no hospitalization data during the first part of COVID-19. This was not the case with European countries. I can make a number of guesses about this including Health Information Privacy (HIPAA), inconsistent data collection, and even a sense of unwillingness by private and public hospitals alike to reveal too much about their business. However, with COVID-19 there seems to be a renewed sense that hospitals are a public good and need to be more transparent. Arizona has a new executive order (23-2020) governing the reporting of COVID-19 data.
There doesn’t seem to be a strong central governance around hospitalization data. Before COVID-19 this was always what I assumed that CDC did, but now I think it’s just a function of the state’s Health Department. I think that if CDC created guidelines and reporting rules, we would have much richer and much more predictive data sets around the health of people in the US. Until then, however, it requires someone to clean data, hand-build datasets, etc., to extract useful information.
Hospital data tends towards the anecdotal side. I have had many forwards from people on LinkedIn or Facebook that came from the cousin of their sister-in-law, who is a surgeon in New York explaining how overwhelmed the ICU is there in whatever town they serve. Then an hour later I get another forward from some connection in the very next county in New York explaining why their hospital has no COVID cases. This is very, very common. I think some of this is due to the above lack of transparency in hospitals, where data is even hidden from employees. I’ve had more than one person who works in a hospital in Arizona tell me at some point during the COVID-19 outbreak that there is only around XX people in the COVID ward right now — “But don’t tell anyone”. I don’t understand the perceived secrecy of this data, but due to the secrecy and poor data reporting, the ancecdote tends to carry the day. Until the next day when the opposite story comes out.
Hospitalizations classified as COVID-19 may not have initially sought treatment for COVID. Florida is starting to run into a new kind of COVID-asymptomatic hospital patient who seeks care for an unrelated issue (broken leg, etc.) and then is tested and found to have COVID-19. This is challenging. Does the patient need to go to the COVID ward? Initially, it seems that yes, they were, but now the state is starting to handle these patients differently (and save the COVID ward for those with COVID symptoms). This is unlikely to be affecting ICU bed numbers, of course, but is possibly affecting inpatient bed counts (which are already reaching maximums as well).
The hospital business and processes are not well understood by the layman, even by the hospital employee at times. This results in lack of understanding of the real meaning behind a data visualization.

Overview of Arizona ICU Bed Management during COVID-19

One thing that is very interesting to me is ICU bed management. Obviously hospitals want to leverage their invested ICU bed capital to make money. This would seem to require limited excess capacity in the ICU — i.e., most beds full most of the time. During COVID-19 one of the earlier stories was how COVID would overrun the ICU’s at most hospitals. I believe this did happen to an extent in New York, but it hasn’t happened yet in Arizona. COVID-19 patients are still less than 40% of all the ICU beds occupied in Arizona, but the number has been growing. See the chart below which compares the percentage of all occupied ICU beds that have a COVID-19 patient in them.

Thoughts:

It seems like the COVID patients peaked as a percentage of the total ICU bed population in mid-April and then gradually tapered off until the lockdown easing was fairly much complete. We then see acceleration in cases drive up the percentage of COVID-19 patients to near 40%. Note there is still 15+% unoccupied ICU beds (though I’m not sure if they’re in the right places). But clearly, whoever the non-COVID patients in the ICU are, they’re decreasing. There may be an element of elective surgeries in the non-COVID ICU population, but I don’t think they’re as many as usual.
I can’t fully get my arms around what this chart is telling me, other than perhaps it shows that hospitals know how to manage their ICU bed resources. The total percentage of ICU beds filled in the state has gone from 74% around 5/21 to about 85% today (total about 130 beds). They have done this while COVID-19 cases in the ICU have increased by about 230 people (hence, the now-higher percentage of COVID patients in ICU beds). I don’t know how they made up those extra 100 people, but they did it somehow. They have some margin to work with, I suppose, because even today, 60+% of all inhabitants of the ICU are non-COVID.
This management is why the increase in hospital bed numbers has been linear while the COVID-19 case growth has been exponential. Here’s a view of hospitalization just compared with the numbers of 65+ COVID-19 cases. Note that the 65+ group which looks pretty linear when compared to the 20-44 age group cases still looks exponential when compared to the hospitalization (especially the ICU) rates.

Comparison of AZ Confirmed Cases over 65 with hospitalization Rates (Maricopa County Data)

June 20, 2020June 21, 2020

COVID-19 Arizona Update – Where are the Cases Happening?

The state AZHS Dashboard provides a download and a map of COVID-19 Cases by Zip code. I was playing with the data and noticed that most of the regions with higher cases per 1000 were areas that were known to have lower median incomes. This intrigued me, because we really don’t know much about who is involved in this current wave of COVID-19 infections (other than the age demographics that I presented in previous articles).

COVID-19 Outbreak Info by the richest and poorest zip codes in Arizona

First caveat… due to tribal regulations, I have zero data for tribal regions, many of which would qualify as areas with very low median incomes. This is too bad and if anyone from these regions is interested in having their data analyzed securely, please contact me.

COVID-19 Cases per 1K by AZ Zip Code compared to Median Income – 6/20/20

My thoughts on this chart:

It is pretty clear that COVID-19 outbreaks are much higher in Zip Codes with lower median incomes. The yellow trend line on the chart shows a R2 score of just under 0.5, which indicates that the trend is a pretty solid fit considering this is real-world data. The average number of cases per 1000 in the poorest 20 zip codes is over 9 and the average of cases per 1000 for the richest 20 zip codes is 1.9. Even if you subtract the Yuma and Nogales zip codes that have some of the highest case rates in the country, the average for the 20 poorest zip codes is 6.5. I posted this chart on Facebook looking for theories on why the situation is this pronounced. There was a lot of good discussion about this, and it is clear that whatever is causal for this disparity, it is comprised of multiple cultural and economic variables.
Lots of businesses that were shut down during the state’s COVID-19 lockdown (gyms, restaurants, and even churches) probably have a much higher representation in the wealthier zip codes, where very large gyms, restaurants, and churches thrive. This would be an interesting study. It does seem like the current outbreak probably has much less to do with these kinds of businesses than we would have guessed.
Some of these regions that have much higher cases per 1000 people are agricultural areas. Most of these particular regions also have a very low death rate. Perhaps there’s something they’re doing that makes them more likely to get infected but less likely to be badly affected.
I also suspect that one influential variable is mask usage. My observation in Tucson is that the Foothills region has been much more diligent around face coverings than other parts of Tucson (and certainly rural Arizona). This may be one reason the Tucson Foothills zip code COVID-19 cases per 1000 are extremely low. This may also apply to regions in Phoenix that are similar to the Tucson Foothills.
Now that Pima and Maricopa county are mandating face coverings in public, we have a great opportunity for a natural experiment on the value of Face Coverings. My guess would be that we’ll see the case count flatten out in about two weeks. The question is whether this would have happened anyway. Hopefully we can compare mask vs. no mask regions afterwards.

Zip Codes with the largest Percent Increases in COVID-19 Cases

Arizona had a few really big case numbers in the days since I posted my first chart comparing cases with median income. Below are the zip codes that had the highest percent increase in the last 2 days. As you can see, these areas of fastest increase are generally in lower-income areas.

COVID-19 Cases by Zip Codes Sorted by Median Age

Unsurprisingly, the zip codes that trend younger are also showing a higher case count in general. This aligns with data where we see the 20-44 age group far outpacing the others in new cases. Interesting trivia: Based off the data from usa.com (came from the American Community Survey of 2010-1014), Colorado City’s median age is 15.

COVID-19 Cases and sorted by Median Age for AZ Zip Codes – 6/20/20

COVID-19 Cases Plotted Against Population Density

Following a similar approach, I also put together a scatter plot showing COVID-19 Cases per 1K people plotted against the Population Density of a Zip Code. The trend is one of the strongest yet (the R2 Coefficient is .29 which is usually pretty decent with non-laboratory data). Not a real surprise, but I imagine that density might be a good proxy for large apartment complexes. I’m thinking about ventilation, etc., when I wonder if people in apartment complexes (perhaps less expensive ones have poorer filtration?) have a higher risk of becoming infected with COVID-19.