如何 collapse/compress/reduce 将 pandas 中的列串起来

Question

本质上，我想做的是使用键将 Table_A 加入 Table_B 以在 Table_B 中进行查找以提取 [=71] 中存在的名称的列记录=].

Table_B 可以被认为是主名称 table 存储了关于名称的各种属性。 Table_A 表示带有名称信息的传入数据。

有两列表示名称 - 名为“raw_name”的列和名为“real_name[的列=55=]'。 'raw_name' 在 real_name.

之前有字符串 "code_"
即

raw_name = CE993_VincentHanna real_name = VincentHanna

Key = real_name，存在于Table_A和Table_B

请参阅 mySQL table 并在此处查询：http://sqlfiddle.com/#!9/65e13/1

对于 Table_A 中的所有 real_names Table_B 我想存储 raw_name/real_name 配对成一个对象，这样我就可以向数据输入人员发送手动插入警报。

对于Table_A中的所有real_names确实存在于Table_B中，这意味着我们知道这个名字并且可以添加新的 raw_name 与此 real_name 关联到我们的主 Table_B

在 mySQL 中，这很容易做到，如您在我的 sqlfidde 示例中所见。我加入 real_name 并且我 compress/collapse 通过 groupby a.real_name 结果，因为我不关心 [=72= 中是否有多个记录] 同样 real_name。

我只想提取属性（stats1、stats2、stats3），这样我就可以将它们分配给新发现的 raw_name。

在 mySQL 查询结果中，我可以分离要发送的 NULL 记录以进行手动数据输入，并自动将剩余记录插入 Table_B。

现在，我正尝试在 Pandas 中做同样的事情，但在实名上卡在了 groupby 点。

e = {'raw_name': pd.Series(['AW103_Waingro', 'CE993_VincentHanna', 'EES43_NeilMcCauley', 'SME16_ChrisShiherlis', 'MEC14_MichaelCheritto', 'OTP23_RogerVanZant', 'MDU232_AlanMarciano']), 'real_name': pd.Series(['Waingro', 'VincentHanna', 'NeilMcCauley', 'ChrisShiherlis', 'MichaelCheritto', 'RogerVanZant', 'AlanMarciano'])} f = {'raw_name': pd.Series(['SME893_VincentHanna', 'TVA405_VincentHanna', 'MET783_NeilMcCauley', 'CE321_NeilMcCauley', 'CIN453_NeilMcCauley', 'NIPS16_ChrisShiherlis', 'ALTW12_MichaelCheritto', 'NSP42_MichaelCheritto', 'CONS23_RogerVanZant', 'WAUE34_RogerVanZant']), 'real_name': pd.Series(['VincentHanna', 'VincentHanna', 'NeilMcCauley', 'NeilMcCauley', 'NeilMcCauley', 'ChrisShiherlis', 'MichaelCheritto', 'MichaelCheritto', 'RogerVanZant', 'RogerVanZant']), 'stats1': pd.Series(['meh1', 'meh1', 'yo1', 'yo1', 'yo1', 'hello1', 'bye1', 'bye1', 'namaste1', 'namaste1']), 'stats2': pd.Series(['meh2', 'meh2', 'yo2', 'yo2', 'yo2', 'hello2', 'bye2', 'bye2', 'namaste2', 'namaste2']), 'stats3': pd.Series(['meh3', 'meh3', 'yo3', 'yo3', 'yo3', 'hello3', 'bye3', 'bye3', 'namaste3', 'namaste3'])} df_e = pd.DataFrame(e) df_f = pd.DataFrame(f) df_new = pd.merge(df_e, df_f, how='left', on='real_name', suffixes=['_left', '_right']) df_new_grouped = df_new.groupby(df_new['raw_name_left'])

现在我如何compress/collapse df_new_grouped 中的群组像我在 mySQL 中那样使用实名 mySQL。

一旦我有了一个包含折叠结果的对象，我就可以对数据帧进行切片以报告 real_names 我们没有记录（NULL 值）和那些我们已经知道并可以存储新的记录发现 raw_name。

Answer 1

您可以删除基于列 raw_name_left 的重复项，也可以使用 drop

删除 raw_name_right 列

In [99]: df_new.drop_duplicates('raw_name_left').drop('raw_name_right', 1)
Out[99]:
            raw_name_left        real_name    stats1    stats2    stats3
0           AW103_Waingro          Waingro       NaN       NaN       NaN
1      CE993_VincentHanna     VincentHanna      meh1      meh2      meh3
3      EES43_NeilMcCauley     NeilMcCauley       yo1       yo2       yo3
6    SME16_ChrisShiherlis   ChrisShiherlis    hello1    hello2    hello3
7   MEC14_MichaelCheritto  MichaelCheritto      bye1      bye2      bye3
9      OTP23_RogerVanZant     RogerVanZant  namaste1  namaste2  namaste3
11    MDU232_AlanMarciano     AlanMarciano       NaN       NaN       NaN

Answer 2

为了彻底，这也可以使用 Groupby 来完成，我在 Wes McKinney 的博客上找到了它，尽管 drop_duplicates 更干净、更高效。

http://wesmckinney.com/blog/filtering-out-duplicate-dataframe-rows/

>index = [gp_keys[0] for gp_keys in df_new_grouped.groups.values()]
>unique_df = df_new.reindex(index)
>unique_df

如何 collapse/compress/reduce 将 pandas 中的列串起来

how to collapse/compress/reduce string columns in pandas

python

group-by

pandas