删除字符串中的重复项,但对于整个数据框
Remove duplicates within a string, but for the entire dataframe
我想实现类似 post 的东西:,但要以一种有效的方式实现整个数据帧。
我的数据看起来像这样:它是一个包含很多列的 pandas 数据框。它有逗号分隔的字符串,其中有很多重复项 - 我希望删除这些单独字符串中的所有重复项。
+--------------------+---------+---------------------+
| Col1 | Col2 | Col3 |
+--------------------+---------+---------------------+
| Dog, Dog, Dog | India | Facebook, Instagram |
| Dog, Squirrel, Cat | Norway | Facebook, Facebook |
| Cat, Cat, Cat | Germany | Twitter |
+--------------------+---------+---------------------+
可重现的例子:
df = pd.DataFrame({"col1": ["Dog, Dog, Dog", "Dog, Squirrel, Cat", "Cat, Cat, Cat"],
"col2": ["India", "Norway", "Germany"],
"col3": ["Facebook, Instagram", "Facebook, Facebook", "Twitter"]})
我想把它改成这样:
+--------------------+---------+---------------------+
| Col1 | Col2 | Col3 |
+--------------------+---------+---------------------+
| Dog | India | Facebook, Instagram |
| Dog, Squirrel, Cat | Norway | Facebook |
| Cat | Germany | Twitter |
+--------------------+---------+---------------------+
让我们get_dummies
然后dot
s=df.col1.str.get_dummies(', ')
df['Col1']=s.dot(s.columns+',').str[:-1]
df
Out[460]:
col1 col2 col3 Col1
0 Dog, Dog, Dog India Facebook, Instagram Dog
1 Dog, Squirrel, Cat Norway Facebook, Facebook Cat,Dog,Squirrel
2 Cat, Cat, Cat Germany Twitter Cat
你可以这样做:
for col in df.columns.tolist():
df[col] = df[col].str.replace(r'\b(\w+)(,+\s+)+\b', r'')
尝试:
for col in ["col1", "col2", "col3"]:
df[col]=df[col].str.split(", ").map(set).str.join(", ")
输出:
>>> df
col1 col2 col3
0 Dog India Facebook, Instagram
1 Dog, Cat, Squirrel Norway Facebook
2 Cat Germany Twitter
我想实现类似 post 的东西:
我的数据看起来像这样:它是一个包含很多列的 pandas 数据框。它有逗号分隔的字符串,其中有很多重复项 - 我希望删除这些单独字符串中的所有重复项。
+--------------------+---------+---------------------+
| Col1 | Col2 | Col3 |
+--------------------+---------+---------------------+
| Dog, Dog, Dog | India | Facebook, Instagram |
| Dog, Squirrel, Cat | Norway | Facebook, Facebook |
| Cat, Cat, Cat | Germany | Twitter |
+--------------------+---------+---------------------+
可重现的例子:
df = pd.DataFrame({"col1": ["Dog, Dog, Dog", "Dog, Squirrel, Cat", "Cat, Cat, Cat"],
"col2": ["India", "Norway", "Germany"],
"col3": ["Facebook, Instagram", "Facebook, Facebook", "Twitter"]})
我想把它改成这样:
+--------------------+---------+---------------------+
| Col1 | Col2 | Col3 |
+--------------------+---------+---------------------+
| Dog | India | Facebook, Instagram |
| Dog, Squirrel, Cat | Norway | Facebook |
| Cat | Germany | Twitter |
+--------------------+---------+---------------------+
让我们get_dummies
然后dot
s=df.col1.str.get_dummies(', ')
df['Col1']=s.dot(s.columns+',').str[:-1]
df
Out[460]:
col1 col2 col3 Col1
0 Dog, Dog, Dog India Facebook, Instagram Dog
1 Dog, Squirrel, Cat Norway Facebook, Facebook Cat,Dog,Squirrel
2 Cat, Cat, Cat Germany Twitter Cat
你可以这样做:
for col in df.columns.tolist():
df[col] = df[col].str.replace(r'\b(\w+)(,+\s+)+\b', r'')
尝试:
for col in ["col1", "col2", "col3"]:
df[col]=df[col].str.split(", ").map(set).str.join(", ")
输出:
>>> df
col1 col2 col3
0 Dog India Facebook, Instagram
1 Dog, Cat, Squirrel Norway Facebook
2 Cat Germany Twitter