估计 Python 中的相关性

Question

我有一个包含标签和用户名的数据集：

Labels   Usernames
1         Londonderry
1         Londoncalling
1          Steveonder43
0         Maryclare_re
1         Patent107391
0         Anonymous 
1         _24londonqr
...

我需要证明包含单词 London 的用户名与标签 1 之间存在相关性。为此，我创建了第二个标签以查看单词 London 的位置

for idx, username in df['Usernames']:
    if 'London' in username:
        df['London'].iloc[idx] = 1
    else:
        df['London'].iloc[idx] = 0

然后我比较了这些二元变量，使用皮尔逊相关系数：

import scipy.stats.pearsonr as rho
corr = rho(df['labels'], df['London'])

然而它不起作用。我在上述步骤中遗漏了什么吗？

Answer 1

你的数据框中有 Labels 但你传递了 labels，我还通过 contains

增强了代码

df['London'] = df['Usernames'].str.contains('London').astype(int)
from scipy import stats
stats.pearsonr(df['Labels'], df['London'])
Out[12]: (0.4, 0.37393392381774704)

估计 Python 中的相关性

Estimate correlation in Python

python

scipy

pandas

pearson-correlation