Python 中的偏相关
Partial Correlation in Python
我运行一个相关矩阵:
sns.pairplot(data.dropna())
corr = data.dropna().corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)
看起来 advisory_pct
与 all_brokerage_pct
相当 (0.57) 负相关。据我所知,我可以说我们相当确定 "when advisor has low % of advisory in his portfolio, he has high % of all brokerage in his portfolio".
然而,这是一个 "pairwise" 相关性,我们没有控制其余可能变量的影响。
我搜索了 SO 但无法找到如何 运行 "partial correlation" 其中相关矩阵可以提供每两个变量之间的相关性 - 同时控制其余变量。为此,我们假设 brokerage %
+ etf brokerage %
+ advisory %
+ all brokerage %
= ~100% 的投资组合。
有这样的功能吗?
-- 编辑 --
运行 数据根据 https://stats.stackexchange.com/questions/288273/partial-correlation-in-panda-dataframe-python:
dict = {'x1': [1, 2, 3, 4, 5], 'x2': [2, 2, 3, 4, 2], 'x3': [10, 9, 5, 4, 9], 'y' : [5.077, 32.330, 65.140, 47.270, 80.570]}
data = pd.DataFrame(dict, columns=['x1', 'x2', 'x3', 'y'])
partial_corr_array = df.as_matrix()
data_int = np.hstack((np.ones((partial_corr_array.shape[0],1)), partial_corr_array))
print(data_int)
[[ 1. 1. 2. 10. 5.077]
[ 1. 2. 2. 9. 32.33 ]
[ 1. 3. 3. 5. 65.14 ]
[ 1. 4. 4. 4. 47.27 ]
[ 1. 5. 2. 9. 80.57 ]]
arr = np.round(partial_corr(partial_corr_array)[1:, 1:], decimals=2)
print(arr)
[[ 1. 0.99 0.99 1. ]
[ 0.99 1. -1. -0.99]
[ 0.99 -1. 1. -0.99]
[ 1. -0.99 -0.99 1. ]]
corr_df = pd.DataFrame(arr, columns = data.columns, index = data.columns)
print(corr_df)
x1 x2 x3 y
x1 1.00 0.99 0.99 1.00
x2 0.99 1.00 -1.00 -0.99
x3 0.99 -1.00 1.00 -0.99
y 1.00 -0.99 -0.99 1.00
这些相关性没有多大意义。使用我的真实数据,我得到了一个非常相似的结果,其中所有相关性都四舍五入为 -1..
AFAIK,scipy/numpy 中没有偏相关的官方实现。正如@J 所指出的。 C. Rocamonde,那个stats网站的函数可以用来计算偏相关。
我相信这是原始来源:
https://gist.github.com/fabianp/9396204419c7b638d38f
注:
如 github 页面中所述,如果您的数据未标准化(从您的数据判断它是不是)。
如果我没记错的话,它通过控制矩阵中所有其他剩余变量来计算偏相关。如果您只想控制一个变量,您可以将 idx
更改为该特定变量的索引。
编辑1(如何加1+df做什么):
如果你查看 link,他们已经讨论过如何添加。
为了说明它是如何工作的,我添加了另一种方法 hstack
,使用 link 中给定的数据:
data_int = np.hstack((np.ones((data.shape[0],1)), data))
test1 = partial_corr(data_int)[1:, 1:]
print(test1)
# You can also add it on the right, as long as you select the correct coefficients
data_int_2 = np.hstack((data, np.ones((data.shape[0],1))))
test2 = partial_corr(data_int_2)[:-1, :-1]
print(test2)
data_std = data.copy()
data_std -= data.mean(axis=0)[np.newaxis, :]
data_std /= data.std(axis=0)[np.newaxis, :]
test3 = partial_corr(data_std)
print(test3)
输出:
[[ 1. -0.54341003 -0.14076948]
[-0.54341003 1. -0.76207595]
[-0.14076948 -0.76207595 1. ]]
[[ 1. -0.54341003 -0.14076948]
[-0.54341003 1. -0.76207595]
[-0.14076948 -0.76207595 1. ]]
[[ 1. -0.54341003 -0.14076948]
[-0.54341003 1. -0.76207595]
[-0.14076948 -0.76207595 1. ]]
如果要维护列,最简单的方法是提取列并在计算后放回:
# Assume that we have a DataFrame with columns x, y, z
data_as_df = pd.DataFrame(data, columns=['x','y','z'])
data_as_array = data_as_df.values
partial_corr_array = partial_corr(np.hstack((np.ones((data_as_array.shape[0],1)), data_as_array))
)[1:,1:]
corr_df = pd.DataFrame(partial_corr_array, columns = data_as_df.columns)
print(corr_df)
输出:
x y z
0 1.000 -0.543 -0.141
1 -0.543 1.000 -0.762
2 -0.141 -0.762 1.000
希望对您有所帮助!如果有任何不清楚的地方,请告诉我!
编辑 2:
我认为问题在于每个拟合中都没有常数项...我重写了 sklearn 中的代码,以便更容易添加截距:
def calculate_partial_correlation(input_df):
"""
Returns the sample linear partial correlation coefficients between pairs of variables,
controlling for all other remaining variables
Parameters
----------
input_df : array-like, shape (n, p)
Array with the different variables. Each column is taken as a variable.
Returns
-------
P : array-like, shape (p, p)
P[i, j] contains the partial correlation of input_df[:, i] and input_df[:, j]
controlling for all other remaining variables.
"""
partial_corr_matrix = np.zeros((input_df.shape[1], input_df.shape[1]));
for i, column1 in enumerate(input_df):
for j, column2 in enumerate(input_df):
control_variables = np.delete(np.arange(input_df.shape[1]), [i, j]);
if i==j:
partial_corr_matrix[i, j] = 1;
continue
data_control_variable = input_df.iloc[:, control_variables]
data_column1 = input_df[column1].values
data_column2 = input_df[column2].values
fit1 = linear_model.LinearRegression(fit_intercept=True)
fit2 = linear_model.LinearRegression(fit_intercept=True)
fit1.fit(data_control_variable, data_column1)
fit2.fit(data_control_variable, data_column2)
residual1 = data_column1 - (np.dot(data_control_variable, fit1.coef_) + fit1.intercept_)
residual2 = data_column2 - (np.dot(data_control_variable, fit2.coef_) + fit2.intercept_)
partial_corr_matrix[i,j] = stats.pearsonr(residual1, residual2)[0]
return pd.DataFrame(partial_corr_matrix, columns = input_df.columns, index = input_df.columns)
# Generating data in our minion world
test_sample = 10000;
Math_score = np.random.randint(100,600, size=test_sample) + 20 * np.random.random(size=test_sample)
Eng_score = np.random.randint(100,600, size=test_sample) - 10 * Math_score + 20 * np.random.random(size=test_sample)
Phys_score = Math_score * 5 - Eng_score + np.random.randint(100,600, size=test_sample) + 20 * np.random.random(size=test_sample)
Econ_score = np.random.randint(100,200, size=test_sample) + 20 * np.random.random(size=test_sample)
Hist_score = Econ_score + 100 * np.random.random(size=test_sample)
minions_df = pd.DataFrame(np.vstack((Math_score, Eng_score, Phys_score, Econ_score, Hist_score)).T,
columns=['Math', 'Eng', 'Phys', 'Econ', 'Hist'])
calculate_partial_correlation(minions_df)
输出:
---- ---------- ----------- ------------ ----------- ------------
Math 1 -0.322462 0.436887 0.0104036 -0.0140536
Eng -0.322462 1 -0.708277 0.00802087 -0.010939
Phys 0.436887 -0.708277 1 0.000340397 -0.000250916
Econ 0.0104036 0.00802087 0.000340397 1 0.721472
Hist -0.0140536 -0.010939 -0.000250916 0.721472 1
---- ---------- ----------- ------------ ----------- ------------
如果这不起作用,请告诉我!
要计算 pandas DataFrame 的两列之间的相关性,同时控制一个或多个协变量(即数据框中的其他列),您可以使用 partial_corr function of the Pingouin 包(免责声明,我是作者):
from pingouin import partial_corr
partial_corr(data=df, x='X', y='Y', covar=['covar1', 'covar2'], method='pearson')
半行代码:
import numpy as np
X=np.random.normal(0,1,(5,5000)) # 5 variable stored as rows
Par_corr = -np.linalg.inv(np.corrcoef(X)) # 5x5 matrix
你可以试试这个:
from sklearn.linear_model import LinearRegression
from scipy.stats import pearsonr
feature_num = df.shape[1]
feature_name = df.columns
partial_corr_matrix = np.zeros((feature_num, feature_num))
for i in range(feature_num):
x1 = df.iloc[:, i]
for j in range(feature_num):
if i == j:
partial_corr_matrix[i, j] = 1
elif j < i:
partial_corr_matrix[i, j] = partial_corr_matrix[j, i]
else:
x2 = df.iloc[:, j]
df_control = df.drop(columns=[feature_name[i], feature_name[j]], axis=1)
L = LinearRegression().fit(df_control, x1)
Lx = L.predict(df_control)
x1_prime = x1 - Lx
L = LinearRegression().fit(df_control, x2)
Lx = L.predict(df_control)
x2_prime = x2 - Lx
partial_corr_matrix[i, j] = pearsonr(x1_prime, x2_prime)[0]
我运行一个相关矩阵:
sns.pairplot(data.dropna())
corr = data.dropna().corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)
看起来 advisory_pct
与 all_brokerage_pct
相当 (0.57) 负相关。据我所知,我可以说我们相当确定 "when advisor has low % of advisory in his portfolio, he has high % of all brokerage in his portfolio".
然而,这是一个 "pairwise" 相关性,我们没有控制其余可能变量的影响。
我搜索了 SO 但无法找到如何 运行 "partial correlation" 其中相关矩阵可以提供每两个变量之间的相关性 - 同时控制其余变量。为此,我们假设 brokerage %
+ etf brokerage %
+ advisory %
+ all brokerage %
= ~100% 的投资组合。
有这样的功能吗?
-- 编辑 -- 运行 数据根据 https://stats.stackexchange.com/questions/288273/partial-correlation-in-panda-dataframe-python:
dict = {'x1': [1, 2, 3, 4, 5], 'x2': [2, 2, 3, 4, 2], 'x3': [10, 9, 5, 4, 9], 'y' : [5.077, 32.330, 65.140, 47.270, 80.570]}
data = pd.DataFrame(dict, columns=['x1', 'x2', 'x3', 'y'])
partial_corr_array = df.as_matrix()
data_int = np.hstack((np.ones((partial_corr_array.shape[0],1)), partial_corr_array))
print(data_int)
[[ 1. 1. 2. 10. 5.077]
[ 1. 2. 2. 9. 32.33 ]
[ 1. 3. 3. 5. 65.14 ]
[ 1. 4. 4. 4. 47.27 ]
[ 1. 5. 2. 9. 80.57 ]]
arr = np.round(partial_corr(partial_corr_array)[1:, 1:], decimals=2)
print(arr)
[[ 1. 0.99 0.99 1. ]
[ 0.99 1. -1. -0.99]
[ 0.99 -1. 1. -0.99]
[ 1. -0.99 -0.99 1. ]]
corr_df = pd.DataFrame(arr, columns = data.columns, index = data.columns)
print(corr_df)
x1 x2 x3 y
x1 1.00 0.99 0.99 1.00
x2 0.99 1.00 -1.00 -0.99
x3 0.99 -1.00 1.00 -0.99
y 1.00 -0.99 -0.99 1.00
这些相关性没有多大意义。使用我的真实数据,我得到了一个非常相似的结果,其中所有相关性都四舍五入为 -1..
AFAIK,scipy/numpy 中没有偏相关的官方实现。正如@J 所指出的。 C. Rocamonde,那个stats网站的函数可以用来计算偏相关。
我相信这是原始来源:
https://gist.github.com/fabianp/9396204419c7b638d38f
注:
如 github 页面中所述,如果您的数据未标准化(从您的数据判断它是不是)。
如果我没记错的话,它通过控制矩阵中所有其他剩余变量来计算偏相关。如果您只想控制一个变量,您可以将
idx
更改为该特定变量的索引。
编辑1(如何加1+df做什么):
如果你查看 link,他们已经讨论过如何添加。
为了说明它是如何工作的,我添加了另一种方法 hstack
,使用 link 中给定的数据:
data_int = np.hstack((np.ones((data.shape[0],1)), data))
test1 = partial_corr(data_int)[1:, 1:]
print(test1)
# You can also add it on the right, as long as you select the correct coefficients
data_int_2 = np.hstack((data, np.ones((data.shape[0],1))))
test2 = partial_corr(data_int_2)[:-1, :-1]
print(test2)
data_std = data.copy()
data_std -= data.mean(axis=0)[np.newaxis, :]
data_std /= data.std(axis=0)[np.newaxis, :]
test3 = partial_corr(data_std)
print(test3)
输出:
[[ 1. -0.54341003 -0.14076948]
[-0.54341003 1. -0.76207595]
[-0.14076948 -0.76207595 1. ]]
[[ 1. -0.54341003 -0.14076948]
[-0.54341003 1. -0.76207595]
[-0.14076948 -0.76207595 1. ]]
[[ 1. -0.54341003 -0.14076948]
[-0.54341003 1. -0.76207595]
[-0.14076948 -0.76207595 1. ]]
如果要维护列,最简单的方法是提取列并在计算后放回:
# Assume that we have a DataFrame with columns x, y, z
data_as_df = pd.DataFrame(data, columns=['x','y','z'])
data_as_array = data_as_df.values
partial_corr_array = partial_corr(np.hstack((np.ones((data_as_array.shape[0],1)), data_as_array))
)[1:,1:]
corr_df = pd.DataFrame(partial_corr_array, columns = data_as_df.columns)
print(corr_df)
输出:
x y z
0 1.000 -0.543 -0.141
1 -0.543 1.000 -0.762
2 -0.141 -0.762 1.000
希望对您有所帮助!如果有任何不清楚的地方,请告诉我!
编辑 2:
我认为问题在于每个拟合中都没有常数项...我重写了 sklearn 中的代码,以便更容易添加截距:
def calculate_partial_correlation(input_df):
"""
Returns the sample linear partial correlation coefficients between pairs of variables,
controlling for all other remaining variables
Parameters
----------
input_df : array-like, shape (n, p)
Array with the different variables. Each column is taken as a variable.
Returns
-------
P : array-like, shape (p, p)
P[i, j] contains the partial correlation of input_df[:, i] and input_df[:, j]
controlling for all other remaining variables.
"""
partial_corr_matrix = np.zeros((input_df.shape[1], input_df.shape[1]));
for i, column1 in enumerate(input_df):
for j, column2 in enumerate(input_df):
control_variables = np.delete(np.arange(input_df.shape[1]), [i, j]);
if i==j:
partial_corr_matrix[i, j] = 1;
continue
data_control_variable = input_df.iloc[:, control_variables]
data_column1 = input_df[column1].values
data_column2 = input_df[column2].values
fit1 = linear_model.LinearRegression(fit_intercept=True)
fit2 = linear_model.LinearRegression(fit_intercept=True)
fit1.fit(data_control_variable, data_column1)
fit2.fit(data_control_variable, data_column2)
residual1 = data_column1 - (np.dot(data_control_variable, fit1.coef_) + fit1.intercept_)
residual2 = data_column2 - (np.dot(data_control_variable, fit2.coef_) + fit2.intercept_)
partial_corr_matrix[i,j] = stats.pearsonr(residual1, residual2)[0]
return pd.DataFrame(partial_corr_matrix, columns = input_df.columns, index = input_df.columns)
# Generating data in our minion world
test_sample = 10000;
Math_score = np.random.randint(100,600, size=test_sample) + 20 * np.random.random(size=test_sample)
Eng_score = np.random.randint(100,600, size=test_sample) - 10 * Math_score + 20 * np.random.random(size=test_sample)
Phys_score = Math_score * 5 - Eng_score + np.random.randint(100,600, size=test_sample) + 20 * np.random.random(size=test_sample)
Econ_score = np.random.randint(100,200, size=test_sample) + 20 * np.random.random(size=test_sample)
Hist_score = Econ_score + 100 * np.random.random(size=test_sample)
minions_df = pd.DataFrame(np.vstack((Math_score, Eng_score, Phys_score, Econ_score, Hist_score)).T,
columns=['Math', 'Eng', 'Phys', 'Econ', 'Hist'])
calculate_partial_correlation(minions_df)
输出:
---- ---------- ----------- ------------ ----------- ------------
Math 1 -0.322462 0.436887 0.0104036 -0.0140536
Eng -0.322462 1 -0.708277 0.00802087 -0.010939
Phys 0.436887 -0.708277 1 0.000340397 -0.000250916
Econ 0.0104036 0.00802087 0.000340397 1 0.721472
Hist -0.0140536 -0.010939 -0.000250916 0.721472 1
---- ---------- ----------- ------------ ----------- ------------
如果这不起作用,请告诉我!
要计算 pandas DataFrame 的两列之间的相关性,同时控制一个或多个协变量(即数据框中的其他列),您可以使用 partial_corr function of the Pingouin 包(免责声明,我是作者):
from pingouin import partial_corr
partial_corr(data=df, x='X', y='Y', covar=['covar1', 'covar2'], method='pearson')
半行代码:
import numpy as np
X=np.random.normal(0,1,(5,5000)) # 5 variable stored as rows
Par_corr = -np.linalg.inv(np.corrcoef(X)) # 5x5 matrix
你可以试试这个:
from sklearn.linear_model import LinearRegression
from scipy.stats import pearsonr
feature_num = df.shape[1]
feature_name = df.columns
partial_corr_matrix = np.zeros((feature_num, feature_num))
for i in range(feature_num):
x1 = df.iloc[:, i]
for j in range(feature_num):
if i == j:
partial_corr_matrix[i, j] = 1
elif j < i:
partial_corr_matrix[i, j] = partial_corr_matrix[j, i]
else:
x2 = df.iloc[:, j]
df_control = df.drop(columns=[feature_name[i], feature_name[j]], axis=1)
L = LinearRegression().fit(df_control, x1)
Lx = L.predict(df_control)
x1_prime = x1 - Lx
L = LinearRegression().fit(df_control, x2)
Lx = L.predict(df_control)
x2_prime = x2 - Lx
partial_corr_matrix[i, j] = pearsonr(x1_prime, x2_prime)[0]