如何在 pandas 中连接两个相等的数据帧,通过 id 区分重复?
How to concatenate two equal dataframes in pandas, differentiating between repetitions by id?
在python3和pandas中我有两个具有相同结构的数据帧
df_posts_final_1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32669 entries, 0 to 32668
Data columns (total 12 columns):
post_id 32479 non-null object
text 31632 non-null object
post_text 30826 non-null object
shared_text 3894 non-null object
time 32616 non-null object
image 24585 non-null object
likes 32669 non-null object
comments 32669 non-null object
shares 32669 non-null object
post_url 26157 non-null object
link 4343 non-null object
cpf 32669 non-null object
dtypes: object(12)
memory usage: 3.0+ MB
df_posts_final_2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33883 entries, 0 to 33882
Data columns (total 12 columns):
post_id 33698 non-null object
text 32755 non-null object
post_text 31901 non-null object
shared_text 3986 non-null object
time 33829 non-null object
image 25570 non-null object
likes 33883 non-null object
comments 33883 non-null object
shares 33883 non-null object
post_url 27286 non-null object
link 4446 non-null object
cpf 33883 non-null object
dtypes: object(12)
memory usage: 3.1+ MB
我想团结他们,我可以这样做:
frames = [df_posts_final_1, df_posts_final_1]
result = pd.concat(frames)
但是"post_id"列有唯一的识别码。因此,当 df_posts_final_1 中有一个 id "X" 时,它不需要在最终数据帧结果中出现两次
例如,如果代码 "FLK1989" 出现在 df_posts_final_1 中,也出现在 df_posts_final_2 中,我只保留 df_posts_final_2
中的最后一条记录
拜托,有没有人知道这样做的正确策略?
修复您的代码添加 groupby
+ tail
frames = [df_posts_final_1, df_posts_final_2]
result = pd.concat(frames).groupby('post_id').tail(1)
或者我们drop_duplicates
frames = [df_posts_final_2,df_posts_final_1]#order here is important
result = pd.concat(frames).drop_duplicates('post_id')
尝试使用:
result = pd.concat(frames).drop_duplicates(subset='post_id', keep='last')
keep='last'
参数将只保留第二个,如你所愿。
在python3和pandas中我有两个具有相同结构的数据帧
df_posts_final_1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32669 entries, 0 to 32668
Data columns (total 12 columns):
post_id 32479 non-null object
text 31632 non-null object
post_text 30826 non-null object
shared_text 3894 non-null object
time 32616 non-null object
image 24585 non-null object
likes 32669 non-null object
comments 32669 non-null object
shares 32669 non-null object
post_url 26157 non-null object
link 4343 non-null object
cpf 32669 non-null object
dtypes: object(12)
memory usage: 3.0+ MB
df_posts_final_2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 33883 entries, 0 to 33882
Data columns (total 12 columns):
post_id 33698 non-null object
text 32755 non-null object
post_text 31901 non-null object
shared_text 3986 non-null object
time 33829 non-null object
image 25570 non-null object
likes 33883 non-null object
comments 33883 non-null object
shares 33883 non-null object
post_url 27286 non-null object
link 4446 non-null object
cpf 33883 non-null object
dtypes: object(12)
memory usage: 3.1+ MB
我想团结他们,我可以这样做:
frames = [df_posts_final_1, df_posts_final_1]
result = pd.concat(frames)
但是"post_id"列有唯一的识别码。因此,当 df_posts_final_1 中有一个 id "X" 时,它不需要在最终数据帧结果中出现两次
例如,如果代码 "FLK1989" 出现在 df_posts_final_1 中,也出现在 df_posts_final_2 中,我只保留 df_posts_final_2
中的最后一条记录拜托,有没有人知道这样做的正确策略?
修复您的代码添加 groupby
+ tail
frames = [df_posts_final_1, df_posts_final_2]
result = pd.concat(frames).groupby('post_id').tail(1)
或者我们drop_duplicates
frames = [df_posts_final_2,df_posts_final_1]#order here is important
result = pd.concat(frames).drop_duplicates('post_id')
尝试使用:
result = pd.concat(frames).drop_duplicates(subset='post_id', keep='last')
keep='last'
参数将只保留第二个,如你所愿。