DATAFRAME 加入和划分
DATAFRAME join and divide
我有一个包含 3 列 A B C 的数据框 dF
dF =
A B C
navigate to "www.xyz.com" to "www.xyz.com" NA
enters valid username "JOHN" enters "JOHN"
enters password "1234567" enters "1234567"
enters RIGHT destination"YUL" enters "YUL"
clicks Customer Service clicks NA
clicks Booking Information from Booking clicks NA
我想找出 A、B+C 之间的差异,其余值将在 D 列中。我希望我的数据框看起来像这样
dF =
A B C D
navigate to "www.xyz.com" to "www.xyz.com" NA navigate
enters valid username "JOHN" enters "JOHN" valid username
enters valid password "1234567" enters "1234567" valid password
enters RIGHT destination"YUL" enters "YUL" RIGHT destination
clicks Customer Service clicks NA Customer Service
clicks Booking Information from Booking clicks NA Booking Information from Booking
我正在使用:
df['D'] = Final_df[['B', 'C']].agg(' '.join, axis=1).str.split(' ')
df['D'] = df.apply(lambda x: ''.join(set(x['A'].split(' ')) - set(x['D'])), axis=1)
但我没有按 D 列中的顺序排列。
df = {'A': ['navigate to "www.xyz.com"',
'enters valid username "JOHN"',
'enters password "1234567"',
'enters RIGHT destination"YUL"',
'clicks Customer Service',
'clicks Booking Information from Booking'],
'B': ['to "www.xyz.com"', 'enters', 'enters', 'enters', 'clicks', 'clicks'],
'C': ['NA', '"JOHN"', '"1234567"', '"YUL"', 'NA', 'NA']}
如果你确定所有单词都是space-separated(第4行不正确),那么你可以使用拆分,但不要将'A'转换为一个集合来保留排序。
a = df['A'].str.split()
b = df['B'].str.split().apply(set)
c = df['C'].str.split().apply(set)
df['D'] = [' '.join([a2 for a2 in a1 if a2 not in (b1 | c1)]) for a1, b1, c1 in zip(a,b,c)]
否则,你可以考虑replace
df['D'] = df.apply(lambda r: r['A'].replace(r['B'], '').replace(r['C'], '').strip(), axis=1)
我有一个包含 3 列 A B C 的数据框 dF
dF =
A B C
navigate to "www.xyz.com" to "www.xyz.com" NA
enters valid username "JOHN" enters "JOHN"
enters password "1234567" enters "1234567"
enters RIGHT destination"YUL" enters "YUL"
clicks Customer Service clicks NA
clicks Booking Information from Booking clicks NA
我想找出 A、B+C 之间的差异,其余值将在 D 列中。我希望我的数据框看起来像这样
dF =
A B C D
navigate to "www.xyz.com" to "www.xyz.com" NA navigate
enters valid username "JOHN" enters "JOHN" valid username
enters valid password "1234567" enters "1234567" valid password
enters RIGHT destination"YUL" enters "YUL" RIGHT destination
clicks Customer Service clicks NA Customer Service
clicks Booking Information from Booking clicks NA Booking Information from Booking
我正在使用:
df['D'] = Final_df[['B', 'C']].agg(' '.join, axis=1).str.split(' ')
df['D'] = df.apply(lambda x: ''.join(set(x['A'].split(' ')) - set(x['D'])), axis=1)
但我没有按 D 列中的顺序排列。
df = {'A': ['navigate to "www.xyz.com"',
'enters valid username "JOHN"',
'enters password "1234567"',
'enters RIGHT destination"YUL"',
'clicks Customer Service',
'clicks Booking Information from Booking'],
'B': ['to "www.xyz.com"', 'enters', 'enters', 'enters', 'clicks', 'clicks'],
'C': ['NA', '"JOHN"', '"1234567"', '"YUL"', 'NA', 'NA']}
如果你确定所有单词都是space-separated(第4行不正确),那么你可以使用拆分,但不要将'A'转换为一个集合来保留排序。
a = df['A'].str.split()
b = df['B'].str.split().apply(set)
c = df['C'].str.split().apply(set)
df['D'] = [' '.join([a2 for a2 in a1 if a2 not in (b1 | c1)]) for a1, b1, c1 in zip(a,b,c)]
否则,你可以考虑replace
df['D'] = df.apply(lambda r: r['A'].replace(r['B'], '').replace(r['C'], '').strip(), axis=1)