删除字符串中的重复项,但对于整个数据框

Remove duplicates within a string, but for the entire dataframe

我想实现类似 post 的东西:,但要以一种有效的方式实现整个数据帧。

我的数据看起来像这样:它是一个包含很多列的 pandas 数据框。它有逗号分隔的字符串,其中有很多重复项 - 我希望删除这些单独字符串中的所有重复项。

+--------------------+---------+---------------------+
|        Col1        |  Col2   |        Col3         |
+--------------------+---------+---------------------+
| Dog, Dog, Dog      | India   | Facebook, Instagram |
| Dog, Squirrel, Cat | Norway  | Facebook, Facebook  |
| Cat, Cat, Cat      | Germany | Twitter             |
+--------------------+---------+---------------------+

可重现的例子:

df = pd.DataFrame({"col1": ["Dog, Dog, Dog", "Dog, Squirrel, Cat", "Cat, Cat, Cat"],
                     "col2": ["India", "Norway", "Germany"],
                     "col3": ["Facebook, Instagram", "Facebook, Facebook", "Twitter"]})

我想把它改成这样:

+--------------------+---------+---------------------+
|        Col1        |  Col2   |        Col3         |
+--------------------+---------+---------------------+
| Dog                | India   | Facebook, Instagram |
| Dog, Squirrel, Cat | Norway  | Facebook            |
| Cat                | Germany | Twitter             |
+--------------------+---------+---------------------+

让我们get_dummies然后dot

s=df.col1.str.get_dummies(', ')
df['Col1']=s.dot(s.columns+',').str[:-1]
df
Out[460]: 
                 col1     col2                 col3              Col1
0       Dog, Dog, Dog    India  Facebook, Instagram               Dog
1  Dog, Squirrel, Cat   Norway   Facebook, Facebook  Cat,Dog,Squirrel
2       Cat, Cat, Cat  Germany              Twitter               Cat

你可以这样做:

for col in df.columns.tolist():
    df[col] = df[col].str.replace(r'\b(\w+)(,+\s+)+\b', r'')

尝试:

for col in ["col1", "col2", "col3"]:
    df[col]=df[col].str.split(", ").map(set).str.join(", ")

输出:

>>> df

                 col1     col2                 col3
0                 Dog    India  Facebook, Instagram
1  Dog, Cat, Squirrel   Norway             Facebook
2                 Cat  Germany              Twitter