COVID-19 Special Update – Can Unsupervised Machine Learning Predict Outbreaks?

Maybe that’s a provocative title, but one of the questions I’m exceptionally curious about is if measurable factors about a locality can be used to predict the locality’s response to a COVID-19 outbreak. I’ve attacked this through a correlation study using features measured by WHO and the World Bank (see LINK here). This project is another attempt to address this question.


The Census has a feature online called QuickFacts. This is a really nice tool where you can pull a lot of information about localities in the US (cities, states, counties, etc.). This information covers broad areas of each locality and consists of elements like population, age/race demographics, housing, family/living arrangements, computer/internet access, education, health, economy, transportation, income, business info, and geography/density. As you can see, this amounts to a whole lot of data about specific localities. See image below. The downside of this tool is I haven’t yet found a way to automate the pulling of data, so I had to collect this data on a number of carefully selected counties by hand. My data collection strategy consisted of ensuring I captured data on counties with a wide range of COVID-19 impact as well as counties of different sizes and types. Once I captured a number of counties in the QuickFacts tool I then blended in my data for the Deaths per 1000 population statistic for that county.


Unsupervised Learning is a form of machine learning which allows one to find hidden structure in data when there isn’t a natural label present. I chose this approach to evaluate whether the Census QuickFact data could be used to build a predictive model for COVID-19 impact because it provides a more visual and explainable way of evaluating the predictive model. Also, I can demonstrate results well despite a small dataset. Both of these reasons should hopefully become more evident a few frames down. QuickFacts provides me 65 different data features for each locality, and this is way too much data to evaluate as one would with normal visualization-based analytics. In general, the human brain is wired for three dimensions of data (x, y, and z; also length, width, height). This is why 3D visualizations are easily consumed by humans. Add a few more dimensions of data, however, and it becomes very hard for our brains to see the patterns. To get around this problem and create a model that lends itself well to human visualization, the first step I take in my approach is running an algorithm called Principal Components Analysis. PCA is a technique that in a nutshell can take X features of data and provide the user with n uncorrelated features. In my case, X is 65 and I choose n to be 2, which will allow me to put the data into a 2D plot. This is a very clever trick that was invented by the great statistician Karl Pearson over 100 years ago. The downside is that when I do a plot where the X axis is Principal Component 1 and the Y axis is Principal Component 2, there’s no obvious mapping of the X-Y relationship in my mind because I have no idea what PC1 and PC2 represent other than orthogonal views of my 65 data features. What you have to keep in mind, though, is that even though we can’t explain to our boss what this relationship really means, we DO know that the Principal Component space represents real information and variation on information from all of those 65 features. If you believe me that the location of a datapoint (a county in our case) in PC-space is important, then you can start to understand why this approach is useful. If you look in the diagram below, this is what plotting these 65 features crunched into 2 Principal Components looks like. To make it clearer which of the datapoints are most similar, I also run an algorithm called K-Means, which is a simple unsupervised learning clustering algorithm where I tell it that I believe there will be X clusters (I chose 6 for this example) and it fits the data to that number of clusters. The clusters are identified on the chart below by the large blue numbers. Note that the crude red and green enclosures and the “Heavily Affected” and “Lightly Affected” labels are done by hand after the plot is generated.

What the Unsupervised Learning Tells us

When I run this algorithm and build this plot, I can see a clear boundary between the counties on the left of the diagram and the counties on the right. At this point, I won’t know what that means until I do a further evaluation, which I show below. I dump all my data including cluster ID’s into a table and then blend in the Deaths per 1000 population numbers for these counties.

Once I sort the data by cluster and apply conditional formatting to the Deaths per 1000 column, I can see a crude trend emerge. In clusters 0, 1, and4 I see more COVID-19 impact than in 2, 3, and 5. Noting this and returning to the PCA chart, you can see that the more heavily affected clusters are on the left side of the chart and the more lightly affected clusters are to the right.

Of course there are exceptions and strangeness that I can’t readily explain here… Maricopa County is clustered with two other large cities (Chicago and Seattle), both of which were hard hit. But when I look at that cluster, it’s not exceptionally tight… there is some Principal Component “distance” between all three. I believe this distance is meaningful. Another strange cluster is number 4, which includes a number of lightly hit suburbs outside the Northeast and the worst-hit county in America, New York. This explains perhaps why it is on the same side of the chart with the more heavily-hit clusters, but I have no idea why they’re together. There’s a reason, but I can’t decipher it without a lot of digging (which I just don’t have time to indulge in). However, overall, this is an interesting trend.

How this could be used

IF I was able to collect significantly more data and I continued to see this trend where location on the PC graph had strong correlation with deaths, then I could run PCA on a number of counties that had very few COVID-19 cases and evaluate where they landed on the PC graph. If a county landed in the area occupied by a hard-hit cluster of counties, there’s an indicator that that county may have similar characteristics to those counties and might be at greater risk to COVID-19. Not a certainty, but even an indicator of risk might trigger extra precautions (and even save lives).

Other Work I’ve done on This Idea

I mentioned that my notion is that the PC distance between counties might also represent something real and have separate correlation with death rates. I did a quick experiment where I calculated the PC distance between each county using the Pythagorean theorem and then graphed the difference in Deaths per 1000 for two counties against the PC distance between those counties. The results are a bit noisy, but I’ll paste the overall results below for you to review. As you can see, there are three major outliers… NYC, which has been crazily hard-hit and Arizona/LA, both of which have been lightly-hit. The coefficient of determination (R2) of .12 tells me that the trend line in the lower portion of the chart is not a good fit. My eyes tell me the same thing… Therefore, I can’t create a good model that relates the Death Rate to the Distance using all the data. I tried different things like removing the outliers and essentially, the trend line on the data in the lower left of this chart gets about as high as a R2 of 0.45, which is interesting, but certainly not compelling.

Stuff that Remains

I’d like to collect more data and do so as the COVID-19 outbreak progresses. There MAY be a better relationship between the deaths and the PC distance, but we may not be able to see it until the disease progresses further. I might spend some calories looking into automating the pull of the census quickfacts data. It’s too time-consuming to do this manually to get the kind of data I think we need.

Supervised learning. There are additional approaches using supervised learning we can try to map the quickfacts features to the deaths per 1000 label. This could also be used to build a predictive model. I chose the Unsupervised approach first so I could demonstrate it with better visualizations, but I have much better algorithms at my disposal using supervised learning. This needs to wait for more data, unfortunately, so stay tuned.

2 Replies to “COVID-19 Special Update – Can Unsupervised Machine Learning Predict Outbreaks?”

  1. very cool Tod! I had thought about the data this way and it is great you have used census data for a cluster and factor analysis.
    You can pull larger data sets from the census easily on the American community survey function and get larger sets, high would obviously affect your factor analysis. My neighbor and I could then bring your cluster-factors into a GIS to look at other patterns possibly like regional climates.. I would try using the census MSA data pull function (areas of 50k or more) because it might be more interesting than counties, showing more of the density effect?

    1. Nice, thanks Angie… I’ll put some effort into pulling the census data through a MSA (?) pull… I chose counties largely because that’s the most granular data I have COVID data for, but I believe I can get it by zip code somewhere now. I can show you how to do this (it’s easy) if you have a use for the clustering data…

Leave a Reply

Your email address will not be published. Required fields are marked *