Mechanics of Building a Correlation Matrix: In case this explanation is interesting or informative to anyone puzzling over these results, to get the above correlation relationship between various features and the rate of growth of COVID-19 cases, I built a large dataset using data from Johns Hopkins (COVID-19 data), the WHO, the World Bank, and a handful of others. In this dataset, I have each country in the world captured as rows in the dataset. Each of the Features above (plus many more) is one of the columns that goes across all of the countries. This is the basic mechanics of putting together a large correlation matrix.
What does this tell us?: First off, the above table just simply lists selected features (‘Female Smoking Rate’, etc.) and their correlation using the Python Pandas correlation function. 1.0 is perfect correlation. As these are the correlations with the feature ‘Instantaneous Rate of Change’, you can see that the correlation of ‘inst_rate_of_change’ is 1.0. It is perfectly correlated with itself. I have eliminated many features with low correlation (meaning 0, not -1) just to make this more readable. This, of course, is because if correlation is close to zero, there’s likely little information about the target (Instantaneous Rate of Change of Confirmed Cases – i.e., today’s Case Growth Rate). However, if the number is between 0.2 and 0.8, I find from years of doing this that there’s enough dependence between the target and the feature to make the case that they are related in an interesting way. Statisticians like to say (probably too often), “Correlation does not Imply Causality” — which is true — but this does not mean that correlation is not valuable as the basis for hypothesis tests for causality. That’s what we’re trying to do here… find environmental factors that might be influencing the different Case Growth Rates across the world.
Is there Anything New Here? Yes, the correlations continue to change as the Case Growth Rates change across the world. By definition, I’m correlating these factors with the current day’s instantaneous slope so the correlations should continue to change. What we’ve been seeing lately is that as the slopes continue to increase across the world the Female Smoking Rate continues to increase in its correlation with the target. I think what this indicates is that the countries with the most severe slopes (Italy, New York, Spain) are probably being hit harder by women who smoke having a higher likelihood at contracting a measurable COVID-19 case. I use the word measurable intentionally here, because these rates are probably driven by countries who are only measuring cases where people have symptoms and require some sort of care. This makes this correlation probably more like a correlation with symptomatic case rates. A subtle point, maybe. One other factor that continues to increase is the negative correlation between case growth and rates of Tuberculosis in a country. This tells us that countries with lots of TB cases have slower COVID-19 case growth rates. This was mildly puzzling to me until 2 days ago when I learned of a study showing that a TB vaccine called BCG may have anti-COVID properties (I’m summarizing broadly. Here’s the link). So that’s pretty exciting to see… even this simplistic approach may have revealed something using Data Science that was not widely known.
Above is the correlation of the same factors as above with the Rate of Deaths from COVID-19. Note that some of the features that are highly correlated with the Rate of Contracting the Disease are less correlated with the Rate of Deaths from the Disease. This is probably not counter-intuitive. What might be counter-intuitive is that comorbidities like Diabetes rates in a country are negatively correlated with the COVID Death Rates. All I can decide is that it might take reframing the reference point. We’re aware that diabetes, high blood pressure, etc., are contributing strongly to the deaths of individuals who are infected with COVID-19. However, this study is about countries who have high rates of Diabetes, High Blood Pressure, or Air Pollution and the correlation of those factors with the Death Rate. Therefore, it is possible that a country with high rates of Diabetes, for instance, has less people who survive that disease long enough to be affected by COVID-19. Perhaps this is a sign that the advanced health care in some countries might be contributing to the numbers of deaths, largely because susceptible people are living longer in those countries? Or perhaps this is just measuring the fact that countries with high rates of diabetes or pollution have yet to be hit by COVID-19? Time will tell.