无法找到不同 pandas 数据集的两列之间的相关性
Unable to find corelation between two columns of different pandas dataset
我有一个数据集,它基本上是一个列表列表
data = [[(datetime.datetime(2018, 12, 6, 10, 0), Decimal('7.0000000000000000')), (datetime.datetime(2018, 12, 6, 11, 0), Decimal('2.0000000000000000')), (datetime.datetime(2018, 12, 6, 12, 0), Decimal('43.6666666666666667')), (datetime.datetime(2018, 12, 6, 14, 0), Decimal('8.0000000000000000')), (datetime.datetime(2018, 12, 7, 9, 0), Decimal('12.0000000000000000')), (datetime.datetime(2018, 12, 7, 10, 0), Decimal('2.0000000000000000')), (datetime.datetime(2018, 12, 7, 11, 0), Decimal('2.0000000000000000')), (datetime.datetime(2018, 12, 7, 17, 0), Decimal('2.0000000000000000'))], [(datetime.datetime(2018, 12, 6, 10, 0), 28.5), (datetime.datetime(2018, 12, 6, 11, 0), 12.75), (datetime.datetime(2018, 12, 6, 12, 0), 12.15), (datetime.datetime(2018, 12, 6, 14, 0), 12.75), (datetime.datetime(2018, 12, 7, 9, 0), 12.75), (datetime.datetime(2018, 12, 7, 10, 0), 12.75), (datetime.datetime(2018, 12, 7, 11, 0), 12.75), (datetime.datetime(2018, 12, 7, 17, 0), 12.75)]]
它基本上包含两个列表,每个列表都有一个 date
和 metric
列。我需要提取每个列表的度量列值并找到它们之间的关联。
注意:每个列表中的日期相似
所以首先我将每个列表加载到 pandas 并设置日期索引。
data1 = data[0]
data2 = data[1]
df1 = pd.DataFrame(data1)
df1[0] = pd.to_datetime(df1[0], errors='coerce')
df1.set_index(0, inplace=True)
df2 = pd.DataFrame(data2)
df2[0] = pd.to_datetime(df2[0], errors='coerce')
df2.set_index(0, inplace=True)
现在我合并两个数据框(它们共享相同的日期)。
df = pd.merge(df1,df2, how='inner', left_index=True, right_index=True)
现在我的数据框看起来像这样
1_x 1_y
0
2018-12-06 10:00:00 7.0000000000000000 28.50
2018-12-06 11:00:00 2.0000000000000000 12.75
2018-12-06 12:00:00 43.6666666666666667 12.15
2018-12-06 14:00:00 8.0000000000000000 12.75
2018-12-07 09:00:00 12.0000000000000000 12.75
2018-12-07 10:00:00 2.0000000000000000 12.75
2018-12-07 11:00:00 2.0000000000000000 12.75
2018-12-07 17:00:00 2.0000000000000000 12.75
现在我需要找到 1_x
和 1_y
两列之间的相关性。所以我这样做了
df.iloc[:,0].corr(df.iloc[:,1])
但是我得到以下错误
Traceback (most recent call last):
File "/home/souvik/Music/UI_Server2/test61.py", line 71, in <module>
print(df.iloc[:,0].corr(df.iloc[:,1]))
File "/home/souvik/django_test/webdev/lib/python3.5/site-packages/pandas/core/series.py", line 1911, in corr
min_periods=min_periods)
File "/home/souvik/django_test/webdev/lib/python3.5/site-packages/pandas/core/nanops.py", line 77, in _f
return f(*args, **kwargs)
File "/home/souvik/django_test/webdev/lib/python3.5/site-packages/pandas/core/nanops.py", line 762, in nancorr
return f(a, b)
File "/home/souvik/django_test/webdev/lib/python3.5/site-packages/pandas/core/nanops.py", line 770, in _pearson
return np.corrcoef(a, b)[0, 1]
File "/home/souvik/django_test/webdev/lib/python3.5/site-packages/numpy/lib/function_base.py", line 2392, in corrcoef
c = cov(x, y, rowvar)
File "/home/souvik/django_test/webdev/lib/python3.5/site-packages/numpy/lib/function_base.py", line 2302, in cov
avg, w_sum = average(X, axis=1, weights=w, returned=True)
File "/home/souvik/django_test/webdev/lib/python3.5/site-packages/numpy/lib/function_base.py", line 391, in average
if scl.shape != avg.shape:
AttributeError: 'float' object has no attribute 'shape'
我不确定发生了什么。我在网上看到的例子是用df['A].corr(df['B'])
得到A
和B
之间的相关性。那我做错了什么?
您的专栏 1_x
有 dtype=object
,从这里可以看出:
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8 entries, 2018-12-06 10:00:00 to 2018-12-07 17:00:00
Data columns (total 2 columns):
1_x 8 non-null object
1_y 8 non-null float64
dtypes: float64(1), object(1)
memory usage: 512.0+ bytes
因此将您的列 1_x
转换为 float
。
使用:
df['1_x'] = df['1_x'].astype(float)
df.iloc[:,0].corr(df.iloc[:,1])
# -0.11679873531647807
我有一个数据集,它基本上是一个列表列表
data = [[(datetime.datetime(2018, 12, 6, 10, 0), Decimal('7.0000000000000000')), (datetime.datetime(2018, 12, 6, 11, 0), Decimal('2.0000000000000000')), (datetime.datetime(2018, 12, 6, 12, 0), Decimal('43.6666666666666667')), (datetime.datetime(2018, 12, 6, 14, 0), Decimal('8.0000000000000000')), (datetime.datetime(2018, 12, 7, 9, 0), Decimal('12.0000000000000000')), (datetime.datetime(2018, 12, 7, 10, 0), Decimal('2.0000000000000000')), (datetime.datetime(2018, 12, 7, 11, 0), Decimal('2.0000000000000000')), (datetime.datetime(2018, 12, 7, 17, 0), Decimal('2.0000000000000000'))], [(datetime.datetime(2018, 12, 6, 10, 0), 28.5), (datetime.datetime(2018, 12, 6, 11, 0), 12.75), (datetime.datetime(2018, 12, 6, 12, 0), 12.15), (datetime.datetime(2018, 12, 6, 14, 0), 12.75), (datetime.datetime(2018, 12, 7, 9, 0), 12.75), (datetime.datetime(2018, 12, 7, 10, 0), 12.75), (datetime.datetime(2018, 12, 7, 11, 0), 12.75), (datetime.datetime(2018, 12, 7, 17, 0), 12.75)]]
它基本上包含两个列表,每个列表都有一个 date
和 metric
列。我需要提取每个列表的度量列值并找到它们之间的关联。
注意:每个列表中的日期相似
所以首先我将每个列表加载到 pandas 并设置日期索引。
data1 = data[0]
data2 = data[1]
df1 = pd.DataFrame(data1)
df1[0] = pd.to_datetime(df1[0], errors='coerce')
df1.set_index(0, inplace=True)
df2 = pd.DataFrame(data2)
df2[0] = pd.to_datetime(df2[0], errors='coerce')
df2.set_index(0, inplace=True)
现在我合并两个数据框(它们共享相同的日期)。
df = pd.merge(df1,df2, how='inner', left_index=True, right_index=True)
现在我的数据框看起来像这样
1_x 1_y
0
2018-12-06 10:00:00 7.0000000000000000 28.50
2018-12-06 11:00:00 2.0000000000000000 12.75
2018-12-06 12:00:00 43.6666666666666667 12.15
2018-12-06 14:00:00 8.0000000000000000 12.75
2018-12-07 09:00:00 12.0000000000000000 12.75
2018-12-07 10:00:00 2.0000000000000000 12.75
2018-12-07 11:00:00 2.0000000000000000 12.75
2018-12-07 17:00:00 2.0000000000000000 12.75
现在我需要找到 1_x
和 1_y
两列之间的相关性。所以我这样做了
df.iloc[:,0].corr(df.iloc[:,1])
但是我得到以下错误
Traceback (most recent call last):
File "/home/souvik/Music/UI_Server2/test61.py", line 71, in <module>
print(df.iloc[:,0].corr(df.iloc[:,1]))
File "/home/souvik/django_test/webdev/lib/python3.5/site-packages/pandas/core/series.py", line 1911, in corr
min_periods=min_periods)
File "/home/souvik/django_test/webdev/lib/python3.5/site-packages/pandas/core/nanops.py", line 77, in _f
return f(*args, **kwargs)
File "/home/souvik/django_test/webdev/lib/python3.5/site-packages/pandas/core/nanops.py", line 762, in nancorr
return f(a, b)
File "/home/souvik/django_test/webdev/lib/python3.5/site-packages/pandas/core/nanops.py", line 770, in _pearson
return np.corrcoef(a, b)[0, 1]
File "/home/souvik/django_test/webdev/lib/python3.5/site-packages/numpy/lib/function_base.py", line 2392, in corrcoef
c = cov(x, y, rowvar)
File "/home/souvik/django_test/webdev/lib/python3.5/site-packages/numpy/lib/function_base.py", line 2302, in cov
avg, w_sum = average(X, axis=1, weights=w, returned=True)
File "/home/souvik/django_test/webdev/lib/python3.5/site-packages/numpy/lib/function_base.py", line 391, in average
if scl.shape != avg.shape:
AttributeError: 'float' object has no attribute 'shape'
我不确定发生了什么。我在网上看到的例子是用df['A].corr(df['B'])
得到A
和B
之间的相关性。那我做错了什么?
您的专栏 1_x
有 dtype=object
,从这里可以看出:
df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8 entries, 2018-12-06 10:00:00 to 2018-12-07 17:00:00
Data columns (total 2 columns):
1_x 8 non-null object
1_y 8 non-null float64
dtypes: float64(1), object(1)
memory usage: 512.0+ bytes
因此将您的列 1_x
转换为 float
。
使用:
df['1_x'] = df['1_x'].astype(float)
df.iloc[:,0].corr(df.iloc[:,1])
# -0.11679873531647807