Programming
Correlation Heatmaps with Significance in Python
While Pandas and Seaborn offer very quick ways to calculate correlations and show them in a heatmap. Whether those correlations are statistically significant or not is omitted from those plots. Over the years I collected bits and pieces of code, like this, that turn out to be quite useful. Though them being scattered across a few dozen projects isn鈥檛 very convenient when I actually need them. So I’ll start to add some documentation and put them here with the tag Code Nugget, so they can easily be found by myself and others. Normally you can use corr_df = df.corr() to get a correlation matrix for numerical columns in a Pandas data frame. These in turn can be shown in a heatmap using sns.clustermap(corr_df, cmap=”vlag”, vmin=-1, vmax=1), leveraging SeaBorn clustermap. Easy, though the significance of those correlations isn’t reported. To get those you can rely on built-in functions and a bit more effort is required. from sklearn.datasets import load_iris from scipy.stats import spearmanr import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from statsmodels.stats.multitest import multipletests iris_obj = load_iris() iris_df = pd.DataFrame(iris_obj.data, columns=iris_obj.feature_names) def get_correlations(df): df = df.dropna()._get_numeric_data() dfcols = pd.DataFrame(columns=df.columns) pvalues = dfcols.transpose().join(dfcols, Read more…