COVID-19 Arizona Outbreaks by Zip Code – Correlation Study with Recommendations

During the early phases of COVID-19 I did some studies to find if there were significant measurable factors (smoking rate, diabetes rate, population density) that had high correlation with COVID case counts or death counts. These studies were revealing and sometimes identified interesting factors that have subsequently emerged as topics of interest regarding their relation to COVID-19.

The current accelerated wave of COVID-19 cases in Arizona has been very interesting from a data science perspective due to the total lack of uniformity of the spread. Hotspots for the virus have included border communities, native communities, inner cities, dense suburbs, and occasionally retirement communities. The inconsistency of spread in these types of regions has been surprising. Some border communities have been overwhelmed while others (Cochise County) have had few cases reported.

AZ Factor Correlation by Zip Codes

This study is to identify if there are factors that we have measured across AZ zip codes that might be correlated with Cases and Normalized Cases (cases per 1000 persons).

Correlation Matrix for Interesting factors across Zip Codes and COVID-19 Cases and Cases per 1000 persons

The above is the overall correlation matrix. Looking across the “Cases” and the “Cases_1000” rows will give the visual impact of correlation of other factors with these target features. See below for a numerical view of the correlations with these two targets.

Correlations of Factors with 1) Cases per 1000 people and 2) total number of Cases

Does this Tell Us Anything?

  1. Question: where did the data come from? I pulled the cases by zip code from the AZ DHS COVID dashboard. The correlating factors are mostly pulled from, which is a real treasure trove of data from Census data or the American Community Survey data. I built up a dataset by merging these data on AZ Zip Codes. This is useful because it’s a view into a smaller regional area. Maricopa County has hundreds of zip codes and only a dozen or so could be said to be hard-hit with COVID-19 cases. Analyzing at the Maricopa County level, therefore, might not reveal much insight.
  2. Question: Why are the numbers so different between Cases and Cases per 1000? To think about this, you may want to imagine a region with a very large population who is seeing a large number of cases (we’ll call this a “Type One” community) and comparing it with a region with a much smaller population (say, 1/10th as large) that is seeing half the number of cases (we’ll call this a “Type Two” community). There is something noteworthy going on in each. In the first region, you have lots of people’s lives being impacted, but the effect may be uniformly distributed across the broader community. Perhaps the situation isn’t devastating to any particular portions of the community. Some groups in this region may even be completely unaffected despite the large number of affected people. Now compare this to the Type Two region who has less cases but a higher percentage of infection. In this region, the situation may be related to a large factor unique to that community. A good example of this are some of the smaller border regions in AZ, where there is a big, notable causal element (cases in Mexico? ) driving the case growth across the whole region. This might help understand the differences between the correlations for Cases — which may be a more interesting measure for the Type One community and Cases per 1000 — which reflects on the Type Two regions that have been more broadly penetrated by the virus.
  3. Analysis of Cases per 1000 correlation: POSITIVE CORRELATIONS: I notice two factors that have greater than .10 correlation with cases per 1000. These are “Use of Public Transportation”, and “Percent of Zip Code that works in the Transportation Industry”. Both of these are interesting because we all recognize that transportation is a centralizing function that may well be also transporting the virus between people. In NYC and NJ, there has been speculation since March that mass transportation was allowing the virus to spread faster. Similarly, people working in the transportation industry have been hit with Coronavirus outbreaks. In Tucson, the biggest outbreak outside care homes and prisons has been at the central UPS warehouse. Its possible to imagine that Zip Codes that have a higher percentage of people relying on public transportation and have more people who work in the transportation industry (truck drivers, UPS/Amazon delivery drivers, Airport workers, etc.) may be more heavily affected by COVID-19 as a percentage of their overall population. One other positive correlation is the percentage of renters in the zip code. This seems like it might be a way to measure the connectedness of living arrangements. In particular, it does seem like zip codes with large numbers of apartment housing have higher COVID-19 case counts. This might indicate that is a real relationship. NEGATIVE CORRELATIONS: Education, Median Age, Median Income. I had noticed early on a relationship where the zip codes with lower median income had much higher COVID-19 case counts. Indeed, these areas also seemed to be continuing to grow at faster rates. All three of these factors could be interconnected and may represent some causal element behind COVID-19 cases that is related to poverty. Median Age is generally lower in areas with low median income (and is likely one of many causalities for low median income). Same applies to Education. In this study, however, lower levels of education seem to be more strongly related to higher cases of COVID-19 cases per 1000 persons than any other factor.
  4. Analysis of Correlations with Raw Case Count: Population and Density are at the top of this list, which makes sense (and doesn’t mean much). Regions with larger populations are typically more dense and therefore have more people get COVID-19. This describes the Type One community situation. Not much analysis is needed here. However, the percentage of renters pops up very high when correlated with raw cases. This makes the case that a zip code with a high number of renters (and probably large numbers of apartments) is going to be more likely to have a high number of COVID-19 cases. The renters measure seems far more related to overall high counts than high percentages of the population getting infected. It is also interesting to note that the correlation of case count with percentage of public transportation users in a zip code is much lower for raw case counts than it was for cases per 1000. This tells me that a high percentage of renters is driving raw counts of COVID-19 but something else is driving Cases per 1000 people. So maybe we have two different kinds of COVID-19 situations. “Type One” is the large city with large numbers of apartment dwellers… this may be a place with lots of interactions and difficulty in distancing. The second type of community (“Type Two”) is one with a large causal driver (the border, a non-distancing culture, a meat packing plant). It would seem like a single solution may not impact both types of community equally. In the Type One community (high raw counts) Education and Median Age have flipped… it would seem like the median age of this kind of region has grown in importance about COVID-19 whereas education’s correlation is about the same. So my cursory analysis would be that renting and low median age are related (makes sense) and are pretty closely tied to reasons behind COVID-19 in the “Type One” region whereas Education levels and Public Transportation are causal in both types of community. Having lots of workers in the transportation industry is less correlated with COVID-19 for Type One communities than it is for Type Two communities.
  5. Boring things to note: Having lots of people in Service Occupations seems to have low to no correlation with COVID outbreaks in either kind of community. Therefore, I’d surmise that attempting to manage service industry companies and workers with a goal of preventing spread will have very low impact. Communities with high numbers of manufacturing workers seem to have higher numbers of both raw COVID-19 case counts as well as Cases per 1000 numbers. But the correlation is pretty low. This may make the case for governments to maintain some sort of oversight of the healthiness of manufacturing operations, but this data would indicate that closing manufacturing jobs won’t significantly prevent COVID-19 cases. And obviously, an increase in the number of “information workers” results in lower numbers of COVID-19 cases in both Type One and Type Two communities.

Conclusion and Recommendations

First off, my data is getting more granular and informational as I go down into zip codes, but there’s still lots to learn (and more factors to collect). I show manufacturing jobs aren’t driving COVID cases, but I don’t know of the existence of meat-packing plants and if that even qualifies as manufacturing, for instance. However, these results are interesting. It might begin to indicate what knobs to turn to “dial back” outbreaks. Recommendations:

  1. By my estimation, it doesn’t seem likely that gyms and restaurants are appropriate knobs for slowing COVID growth. There’s little effect with high percentages of service employees in a region and wealthier communities (who may be more likely to be going to gyms and restaurants) are seeing much lower case levels.
  2. To control raw numbers of cases, maybe a good knob to turn would be to investigate the role of high-density housing and public transportation and attack root causes that emerge.
  3. For Type Two communities that the data reveals must have a larger causal problem, investigation into the unique qualities of that community might be a more effective intervention than broad shutdowns of their economy.
  4. And finally, the data does show that low education is related to high cases, so doing a better job at educating the communities in relevant ways would be a strong play.

Leave a Reply

Your email address will not be published. Required fields are marked *