DATAFRAME 加入和划分

DATAFRAME join and divide

我有一个包含 3 列 A B C 的数据框 dF

dF =       
           
               A                                      B                  C
        navigate to "www.xyz.com"               to "www.xyz.com"        NA
     enters valid username "JOHN"                enters                "JOHN"
    enters password "1234567"                    enters                "1234567"
    enters  RIGHT destination"YUL"                enters               "YUL"
    clicks Customer Service                      clicks                 NA
    clicks Booking Information from Booking      clicks                 NA

我想找出 A、B+C 之间的差异,其余值将在 D 列中。我希望我的数据框看起来像这样

dF =       
        
               A                                      B                     C                 D
        navigate to "www.xyz.com"               to "www.xyz.com"        NA              navigate
     enters valid username "JOHN"                enters                "JOHN"           valid username
    enters valid password "1234567"               enters              "1234567"         valid password 
    enters  RIGHT destination"YUL"                enters               "YUL"            RIGHT destination
    clicks Customer Service                      clicks                 NA              Customer Service
    clicks Booking Information from Booking      clicks                 NA              Booking Information from Booking

我正在使用:

df['D'] = Final_df[['B', 'C']].agg(' '.join, axis=1).str.split(' ') 

df['D'] = df.apply(lambda x: ''.join(set(x['A'].split(' ')) - set(x['D'])), axis=1)

但我没有按 D 列中的顺序排列。

df = {'A': ['navigate to "www.xyz.com"',
  'enters valid username "JOHN"',
  'enters password "1234567"',
  'enters  RIGHT destination"YUL"',
  'clicks Customer Service',
  'clicks Booking Information from Booking'],
 'B': ['to "www.xyz.com"', 'enters', 'enters', 'enters', 'clicks', 'clicks'],
 'C': ['NA', '"JOHN"', '"1234567"', '"YUL"', 'NA', 'NA']}

如果你确定所有单词都是space-separated(第4行不正确),那么你可以使用拆分,但不要将'A'转换为一个集合来保留排序。

a = df['A'].str.split()
b = df['B'].str.split().apply(set)
c = df['C'].str.split().apply(set)

df['D'] = [' '.join([a2 for a2 in a1 if a2 not in (b1 | c1)]) for a1, b1, c1 in zip(a,b,c)]

否则,你可以考虑replace

df['D'] = df.apply(lambda r: r['A'].replace(r['B'], '').replace(r['C'], '').strip(), axis=1)