在 pandas 数据框行中保留唯一的单词
keep unique words in a pandas dataframe row
数据框:
> df
>type(df)
pandas.core.frame.DataFrame
ID Property Type Amenities
1952043 Apartment, Villa, Apartment Park, Jogging Track, Park
1918916 Bungalow, Cottage House, Cottage, Bungalow Garden, Play Ground
如何在数据框行中仅保留由 "comma" 分隔的唯一 字词 ?在这种情况下,它 不能 认为 "Cottage House" 和 "Cottage" 相同。它必须检查数据框的 所有列 。所以我想要的输出应该如下所示:
期望的输出:
ID Property Type Amenities
1952043 Apartment, Villa Park, Jogging Track
1918916 Bungalow, Cottage House, Cottage Garden, Play Ground
首先,我创建了一个函数来为给定的字符串执行您想要的操作。其次,我将此函数应用于列中的所有字符串。
import numpy as np
import pandas as pd
df = pd.DataFrame([['Apartment, Villa, Apartment',
'Park, Jogging Track, Park'],
['Bungalow, Cottage House, Cottage, Bungalow',
'Garden, Play Ground']],
columns=['Property Type', 'Amenities'])
def drop_duplicates(row):
# Split string by ', ', drop duplicates and join back.
words = row.split(', ')
return ', '.join(np.unique(words).tolist())
# drop_duplicates is applied to all rows of df.
df['Property Type'] = df['Property Type'].apply(drop_duplicates)
df['Amenities'] = df['Amenities'].apply(drop_duplicates)
print(df)
将文件读入pandasDataFrame
>>> import pandas as pd
>>> df = pd.read_csv('test.txt', sep='\t')
>>> df['Property Type'].apply(lambda cell: set([c.strip() for c in cell.split(',')]))
0 {Apartment, Villa}
1 {Cottage, Bungalow, Cottage House}
Name: Property Type, dtype: object
主要思想是
- 遍历每一行,
- 将目标列中的字符串拆分为
,
- return 来自步骤 2
的列表的唯一 set()
代码:
>>> for row in proptype_column: # Step 1.
... items_in_row = row.split(', ') # Step 2.
... uniq_items_in_row = set(row.split(', ')) # Step 3.
... print(uniq_items_in_row)
...
set(['Apartment', 'Villa'])
set(['Cottage', 'Bungalow', 'Cottage House'])
现在您可以使用 DataFrame.apply()
函数实现相同的功能:
>>> import pandas as pd
>>> df = pd.read_csv('test.txt', sep='\t')
>>> df['Property Type'].apply(lambda cell: set([c.strip() for c in cell.split(',')]))
0 {Apartment, Villa}
1 {Cottage, Bungalow, Cottage House}
Name: Property Type, dtype: object
>>> proptype_uniq = df['Property Type'].apply(lambda cell: set(cell.split(', ')))
>>> df['Property Type (Unique)'] = proptype_uniq
>>> df
ID Property Type \
0 12345 Apartment, Villa, Apartment
1 67890 Bungalow, Cottage House, Cottage, Bungalow
Amenities Property Type (Unique)
0 Park, Jogging Track, Park {Apartment, Villa}
1 Garden, Play Ground {Cottage, Bungalow, Cottage House}
数据框:
> df
>type(df)
pandas.core.frame.DataFrame
ID Property Type Amenities
1952043 Apartment, Villa, Apartment Park, Jogging Track, Park
1918916 Bungalow, Cottage House, Cottage, Bungalow Garden, Play Ground
如何在数据框行中仅保留由 "comma" 分隔的唯一 字词 ?在这种情况下,它 不能 认为 "Cottage House" 和 "Cottage" 相同。它必须检查数据框的 所有列 。所以我想要的输出应该如下所示: 期望的输出:
ID Property Type Amenities
1952043 Apartment, Villa Park, Jogging Track
1918916 Bungalow, Cottage House, Cottage Garden, Play Ground
首先,我创建了一个函数来为给定的字符串执行您想要的操作。其次,我将此函数应用于列中的所有字符串。
import numpy as np
import pandas as pd
df = pd.DataFrame([['Apartment, Villa, Apartment',
'Park, Jogging Track, Park'],
['Bungalow, Cottage House, Cottage, Bungalow',
'Garden, Play Ground']],
columns=['Property Type', 'Amenities'])
def drop_duplicates(row):
# Split string by ', ', drop duplicates and join back.
words = row.split(', ')
return ', '.join(np.unique(words).tolist())
# drop_duplicates is applied to all rows of df.
df['Property Type'] = df['Property Type'].apply(drop_duplicates)
df['Amenities'] = df['Amenities'].apply(drop_duplicates)
print(df)
将文件读入pandasDataFrame
>>> import pandas as pd
>>> df = pd.read_csv('test.txt', sep='\t')
>>> df['Property Type'].apply(lambda cell: set([c.strip() for c in cell.split(',')]))
0 {Apartment, Villa}
1 {Cottage, Bungalow, Cottage House}
Name: Property Type, dtype: object
主要思想是
- 遍历每一行,
- 将目标列中的字符串拆分为
,
- return 来自步骤 2 的列表的唯一
set()
代码:
>>> for row in proptype_column: # Step 1.
... items_in_row = row.split(', ') # Step 2.
... uniq_items_in_row = set(row.split(', ')) # Step 3.
... print(uniq_items_in_row)
...
set(['Apartment', 'Villa'])
set(['Cottage', 'Bungalow', 'Cottage House'])
现在您可以使用 DataFrame.apply()
函数实现相同的功能:
>>> import pandas as pd
>>> df = pd.read_csv('test.txt', sep='\t')
>>> df['Property Type'].apply(lambda cell: set([c.strip() for c in cell.split(',')]))
0 {Apartment, Villa}
1 {Cottage, Bungalow, Cottage House}
Name: Property Type, dtype: object
>>> proptype_uniq = df['Property Type'].apply(lambda cell: set(cell.split(', ')))
>>> df['Property Type (Unique)'] = proptype_uniq
>>> df
ID Property Type \
0 12345 Apartment, Villa, Apartment
1 67890 Bungalow, Cottage House, Cottage, Bungalow
Amenities Property Type (Unique)
0 Park, Jogging Track, Park {Apartment, Villa}
1 Garden, Play Ground {Cottage, Bungalow, Cottage House}