python 如果下一列不在上一列中,则连接多列

python concat multiple columns if next column not in previous column

我有这样的示例数据:

col1                    col2        col3
PYTHON RD              APT 3         NaN
STACK AVE APT 2-3    APT 2-3         NaN
OVER ST 1/2         UNIT 1/2    UNIT 1/2
FLOW RD                  NaN         NaN

我想创建一个新字段:

col1                    col2        col3               COMBINED
PYTHON RD              APT 3         NaN        PYTHON RD APT 3
STACK AVE APT 2-3    APT 2-3         NaN      STACK AVE APT 2-3
OVER ST 1/2         UNIT 1/2    UNIT 1/2   OVER ST 1/2 UNIT 1/2
FLOW RD                  NaN         NaN                FLOW RD

我试过:

columns = ["col1", "col2", "col3"]
COMBINED = ''
for col in columns:
    df[col] = df[col].fillna("")
    COMBINED = COMBINED + df[col].str.strip() + ' '
    df['COMBINED'] = COMBINED.str.strip()

以上一个可以合并,但在第二次观察中重复STACK AVE APT 2-3 APT 2-3

有什么建议吗?

print(
    df[["col1", "col2"]]
    .fillna("")
    .apply(
        lambda x: x.loc["col1"]
        if x.loc["col2"] in x.loc["col1"]
        else x.loc["col1"] + " " + x.loc["col2"],
        axis=1,
    )
)

打印:

                col1      col2              COMBINED
0          PYTHON RD     APT 3       PYTHON RD APT 3
1  STACK AVE APT 2-3   APT 2-3     STACK AVE APT 2-3
2        OVER ST 1/2  UNIT 1/2  OVER ST 1/2 UNIT 1/2
3            FLOW RD       NaN               FLOW RD

编辑:对于许多列:

def combine(x):
    out = []
    for word in x:
        if word and not any(word in w for w in out):
            out.append(word)
    return " ".join(out)


columns = ["col1", "col2", "col3"]
df["COMBINED"] = df[columns].fillna("").apply(combine, axis=1)
print(df)

打印:

                col1      col2      col3              COMBINED
0          PYTHON RD     APT 3       NaN       PYTHON RD APT 3
1  STACK AVE APT 2-3   APT 2-3       NaN     STACK AVE APT 2-3
2        OVER ST 1/2  UNIT 1/2  UNIT 1/2  OVER ST 1/2 UNIT 1/2
3            FLOW RD       NaN       NaN               FLOW RD

不确定这是否涵盖了您的所有情况:

def combine(row):
    row = row.fillna("")
    result = row["col1"]
    for col in ["col2", "col3"]:
        if not row[col] in result:
            result += " " + row[col]
    return result
    
df["COMBINED"] = df.apply(combine, axis=1)

让我们尝试使用 unique 和 join

df['col4']=df.fillna('').apply(lambda X:",".join(X.unique()).strip('\,$'),axis=1)


     

            col1       col2         col3                  col4
0          PYTHON RD     APT 3       NaN            PYTHON RD,APT 3
1  STACK AVE APT 2-3   APT 2-3       NaN  STACK AVE APT 2-3,APT 2-3
2        OVER ST 1/2  UNIT 1/2  UNIT 1/2       OVER ST 1/2,UNIT 1/2
3            FLOW RD       NaN       NaN                    FLOW RD