从 Dataframe 中删除单个单词字符串并将它们移动到 csv

Removing single word strings from a Dataframe and moving them to a csv

我正在尝试从数据帧 (ou) 中删除单个单词字符串并将其移动到另一个数据帧(removedSetallowedSet),然后移动到 csv( names.csvremoved.csv)。我能够过滤掉特定的字符串,但是我无法从我刚刚制作的数据框中删除单个单词 allowedSet.

所以我需要使用我刚刚制作的两个数据帧并检查它们是否有单个单词字符串。我想将单个单词附加到已删除字符串 removedSet 的数据框中,并从另一个只有全名 allowedSet.

的数据框中删除单个单词

这是我想要的输出 COMPnames.csv:

COMP:Jenny Pepper:
COMP:Harry Perlinkle

COMPremoved.csv

COMP:LINK:
COMP:Printer90
COMP:faxeast test

但是我的 COMPnames.csv 文件有:

COMP:Jenny Pepper:

COMPremoved.csv

COMP:Harry Perlinkle
COMP:LINK:
COMP:Printer90
COMP:faxeast test

然后对新闻部门等做同样的事情

下面的代码显示了我到目前为止所做的尝试。

这是我的代码的一个较小规模的示例,您可以 运行:

import pandas as pd

ou = {'OU':  ['COMP:Jenny Pepper:', 'COMP:Harry Perlinkle', 'COMP:LINK:', 'NEWS:Peter Parker:', 'NEWS:PARK:', 'NEWS:Clark Kent:', 'NEWS:Store Merch', 'COMP:Printer90', 'NEWS:store123', 'COMP:faxeast test']}

df = pd.DataFrame(ou)

#my ou list
oulist = pd.DataFrame({'OU':['COMP', 'NEWS']})

#the strings I want to remove
removeStr = ['printer', 'store', 'fax', 'link', 'park']

for dept in oulist ['OU']:
        #I want only the rows that contains strings in the OU column
        df_dept = df[df['OU'].str.startswith(f'{dept}:')]
        
        #put all of them in one csv file
        df_dept['OU'].to_csv(f'{dept}all.csv', index=False, header=False)
        
        #these two lines look in the removeStr list and make sure to check between the ':' so it doesn't grab the department name.         
        removedStr = [any(x in row[row.find(':')+1:].lower() for x in removeStr ) for row in df_dept['OU']]
        allowedStr = [all(not x in row[row.find(':')+1:].lower() for x in removeStr ) for row in df_dept['OU']]
        
        #remake the dataframe now
        removedSet = df_dept[removedStr]
        csvRemove = f'{dept}removed.csv'
        
        allowedSet = df_dept[allowedStr ]
        csvAllowed = f'{dept}names.csv'

        #move both of them to csv
        removedSet.to_csv(csvRemove, sep=',', encoding='utf-8', index=False,  mode='a', header=False)
        allowedSet.to_csv(csvAllowed , sep=',', encoding='utf-8', index=False, mode='a', header=False)

下面显示了我的一个尝试。我想通过一个简单的 str.split() 手动删除单个单词字符串,然后将它们移动到新的数据帧 df1df2。但这也没有坏处。

m=removedSet['OU'].str.split('^[\w]+\:|\s').str.len()==2
m=list(m) #as this is how it's outputted for removedStr and allowedStr
df1=removedSet[m]
df2=allowedSet[~m]

#move both of them to csv
df1.to_csv(csvRemove, sep=',', encoding='utf-8', index=False,  mode='a', header=False)
df2.to_csv(csvAllowed , sep=',', encoding='utf-8', index=False, mode='a', header=False)        

我收到这个错误:

df2=allowedSet[~m]
TypeError: bad operand type for unary ~: 'list'

IIUC,使用带有单词边界的正则表达式和 groupby 来保存您的文件:

import re
regex = '|'.join(re.escape(w) for w in removeStr)
# 'printer|store|fax|link|park'

group1 = df['OU'].str.extract('([^:]+)', expand=False)
group2 = (df['OU'].str.contains(fr'\b({regex})\d*\b', case=False)
                  .map({True: 'removed',
                        False: 'names'}))
for (g1, g2), g in df.groupby([group1, group2]):
    filename = f'{g1}_{g2}.csv'
    print(f'saving "{filename}"')
    print(g)
    #g.to_csv(filename) # uncomment to save

输出:

saving "COMP_names.csv"
                     OU
0    COMP:Jenny Pepper:
1  COMP:Harry Perlinkle
9     COMP:faxeast test
saving "COMP_removed.csv"
               OU
2      COMP:LINK:
7  COMP:Printer90
saving "NEWS_names.csv"
                   OU
3  NEWS:Peter Parker:
5    NEWS:Clark Kent:
saving "NEWS_removed.csv"
                 OU
4        NEWS:PARK:
6  NEWS:Store Merch
8     NEWS:store123

注意。生成的正则表达式 '\b(printer|store|fax|link|park)\d*\b' 将黑名单中的单词作为整个单词进行匹配,可以选择在末尾允许数字。