使用 groupby 构造数据框

Question

我的数据框如下所示：

                date    id     pct_change
12355258    2010-07-28  60059   0.210210
12355265    2010-07-28  60060   0.592000
12355282    2010-07-29  60059   0.300273
12355307    2010-07-29  60060   0.481982
12355330    2010-07-28  60076   0.400729

我想用 'target'、'source'、'weights' 列来编写它，其中： 'target' 和 'source' 都是 'id'，'weights' 计算 'target' 和 'source' 同时更改价格的天数。所以它看起来像：

target  source  weights
60059   60060   2
60059   60076   1   
60060   60076   1

我的目标是使用此数据框制作 networkx 图。

我试过使用groupby

df.groupby(['date','id'])['id'].unique().value_counts()
df.groupby(['date','id'])['id'].count()

和 for 循环（很糟糕）。

我觉得我在groupby中少了一小步，但又说不出少了什么。

感谢您的帮助。

Answer 1

如果 id 对每个日期 pct_change 都有一个 pct_change，则首先使用 pivto_table 来获取 True

#first pivot to get True if any value of id for a date
df_ = df.pivot_table(index='id', columns='date', values='pct_change', 
                     aggfunc=any, fill_value=False)
print(df_)
date  2010-07-28 2010-07-29
id                         
60059       True       True
60060       True       True
60076       True      False

然后您可以使用 itertools 中的 combination 创建所有可能的对，将它们用于 select 行并使用 & 运算符来查看两者的位置在同一日期为真，沿列求和（获取权重列）。将此列分配给从两个组合列表创建的数据框。

# get all combinations of ids
from itertools import combinations
a, b = map(list, zip(*combinations(df_.index, 2)))

res = (pd.DataFrame({'target':a, 'source':b})
         .assign(weigths=(df_.loc[a].to_numpy()
                          &df_.loc[b].to_numpy()
                         ).sum(axis=1))
      )
print(res)
   target  source  weigths
0   60059   60060        2
1   60059   60076        1
2   60060   60076        1

注意：不要忘记将 pivot_table 中的 index='id' 更改为您的分类列的名称，否则您的计算机很可能无法处理以下操作并崩溃

Answer 2

试试这个

import pandas as pd, numpy as np

ids = df.id.unique()
WeightDf = pd.DataFrame(index=ids, columns=ids)
WeightDf.loc[:, :] = 0

def weigh(ID):
    IdDates =  set(df.loc[df.id==ID].date.to_list())
    for i in ids:
        WeightDf.at[ID, i] = len(set.intersection(set(df.loc[df.id==i].date.to_list()), IdDates))
        
pd.Series(ids).apply(weigh)
print(WeightDf)

import itertools as itt
result = pd.DataFrame(columns=['Id1', 'Id2', 'Weight'])
for i1, i2 in itt.combinations(ids, 2):
    result = pd.concat([result, pd.DataFrame(data=[{'Id1':i1, 'Id2':i2,'Weight':WeightDf.loc[i1, i2]}])])

print(result)

Answer 3

看到这个用例的很多变化 - 生成组合

import itertools

df = pd.read_csv(io.StringIO("""                date    id     pct_change
12355258    2010-07-28  60059   0.210210
12355265    2010-07-28  60060   0.592000
12355282    2010-07-29  60059   0.300273
12355307    2010-07-29  60060   0.481982
12355330    2010-07-28  60076   0.400729"""), sep="\s+")

# generate combinations of two... edge case when a group has only one member
# tuple of itself to itself
dfx = (df.groupby('date').agg({"id": lambda s: list(itertools.combinations(list(s), 2))
                               if len(list(s))>1 else [tuple(list(s)*2)]})
    .explode("id")
     .groupby("id").agg({"id":"count"})
     .rename(columns={"id":"weights"})
     .reset_index()
     .assign(target=lambda dfa: dfa["id"].apply(lambda s: s[0]),
           source=lambda dfa: dfa["id"].apply(lambda s: s[1]))
     .drop(columns="id")
)

print(dfx.to_string(index=False))

输出

 weights  target  source
       2   60059   60060
       1   60059   60076
       1   60060   60076

Answer 4

最终更快地回答了我的问题，该问题适用于大量 ID。它更接近于我之前尝试使用的 groupby + value_counts。

这是代码，方便以后的人使用：

from itertools import combinations

def combine(batch):
    """Combine all products within one batch into pairs"""
    return pd.Series(list(combinations(set(batch), 2)))

edges = df.groupby('date')['id'].apply(combine).value_counts()

c = ['source', 'target']
L = edges.index.values.tolist()
edges = pd.DataFrame(L, columns=c).join(edges.reset_index(drop=True))

使用 groupby 构造数据框

Construct data frame using groupby

python

graph

networkx

dataframe

pandas