在 pandas 数据框行中保留唯一的单词

keep unique words in a pandas dataframe row

数据框:

> df
>type(df)
pandas.core.frame.DataFrame

ID      Property Type                                Amenities
1952043 Apartment, Villa, Apartment                  Park, Jogging Track, Park
1918916 Bungalow, Cottage House, Cottage, Bungalow   Garden, Play Ground

如何在数据框行中仅保留由 "comma" 分隔的唯一 字词 ?在这种情况下,它 不能 认为 "Cottage House" 和 "Cottage" 相同。它必须检查数据框的 所有列 。所以我想要的输出应该如下所示: 期望的输出:

    ID      Property Type                      Amenities
    1952043 Apartment, Villa                   Park, Jogging Track
    1918916 Bungalow, Cottage House, Cottage   Garden, Play Ground

首先,我创建了一个函数来为给定的字符串执行您想要的操作。其次,我将此函数应用于列中的所有字符串。

import numpy as np
import pandas as pd

df = pd.DataFrame([['Apartment, Villa, Apartment',
                    'Park, Jogging Track, Park'],
                   ['Bungalow, Cottage House, Cottage, Bungalow',
                    'Garden, Play Ground']],
                  columns=['Property Type', 'Amenities'])

def drop_duplicates(row):
    # Split string by ', ', drop duplicates and join back.
    words = row.split(', ')
    return ', '.join(np.unique(words).tolist())

# drop_duplicates is applied to all rows of df.
df['Property Type'] = df['Property Type'].apply(drop_duplicates)
df['Amenities'] = df['Amenities'].apply(drop_duplicates)
print(df)

将文件读入pandasDataFrame

>>> import pandas as pd
>>> df = pd.read_csv('test.txt', sep='\t')
>>> df['Property Type'].apply(lambda cell: set([c.strip() for c in cell.split(',')]))
0                    {Apartment, Villa}
1    {Cottage, Bungalow, Cottage House}
Name: Property Type, dtype: object

主要思想是

  1. 遍历每一行,
  2. 将目标列中的字符串拆分为,
  3. return 来自步骤 2
  4. 的列表的唯一 set()

代码:

>>> for row in proptype_column: # Step 1.
...     items_in_row = row.split(', ') # Step 2. 
...     uniq_items_in_row = set(row.split(', ')) # Step 3. 
...     print(uniq_items_in_row)
... 
set(['Apartment', 'Villa'])
set(['Cottage', 'Bungalow', 'Cottage House'])

现在您可以使用 DataFrame.apply() 函数实现相同的功能:

>>> import pandas as pd
>>> df = pd.read_csv('test.txt', sep='\t')
>>> df['Property Type'].apply(lambda cell: set([c.strip() for c in cell.split(',')]))
0                    {Apartment, Villa}
1    {Cottage, Bungalow, Cottage House}
Name: Property Type, dtype: object
>>> proptype_uniq = df['Property Type'].apply(lambda cell: set(cell.split(', ')))
>>> df['Property Type (Unique)'] = proptype_uniq
>>> df
      ID                               Property Type  \
0  12345                 Apartment, Villa, Apartment   
1  67890  Bungalow, Cottage House, Cottage, Bungalow   

                   Amenities              Property Type (Unique)  
0  Park, Jogging Track, Park                  {Apartment, Villa}  
1        Garden, Play Ground  {Cottage, Bungalow, Cottage House}