如何根据某些字符串是否存在于另一列中来创建新列?
How to create a new column based on if certain strings exist in another column?
我有一个 table 看起来像这样:
+--------+-------------+
| Time | Locations |
+--------+-------------+
| 1/1/22 | A300-abc |
+--------+-------------+
| 1/2/22 | A300-FFF |
+--------+-------------+
| 1/3/22 | A300-ABC123 |
+--------+-------------+
| 1/4/22 | B700-abc |
+--------+-------------+
| 1/5/22 | B750-EEE |
+--------+-------------+
| 1/6/22 | M-200-68 |
+--------+-------------+
| 1/7/22 | ABC-abc |
+--------+-------------+
我想派生出如下所示的 table:
+--------+-------------+-----------------+
| Time | Locations | Locations_Clean |
+--------+-------------+-----------------+
| 1/1/22 | A300-abc | A300 |
+--------+-------------+-----------------+
| 1/2/22 | A300 FFF | A300 |
+--------+-------------+-----------------+
| 1/3/22 | A300-ABC123 | A300 |
+--------+-------------+-----------------+
| 1/4/22 | B700-abc | B700 |
+--------+-------------+-----------------+
| 1/5/22 | B750-EEE | B750 |
+--------+-------------+-----------------+
| 1/6/22 | M-200-68 | M-200 |
+--------+-------------+-----------------+
| 1/7/22 | ABC-abc | "not_listed" |
+--------+-------------+-----------------+
基本上我有一个位置代码应该是什么的列表,例如[“A300”,“B700”,“B750”,“M-200”],但目前位置栏与其他随机字符串非常混乱。我想创建一个新列来显示位置代码的“清理”版本,不在该列表中的任何内容都应标记为“not_listed”。
使用正则表达式和 when 条件。在这种情况下,我检查字符串是否以数字 ^[0-9]
开头,然后提取字符串中的前导数字。如果没有,则将其归为未列出。下面的代码
df=df.withColumn('Locations_Clean', when(col("Locations").rlike("^[0-9]"),regexp_extract('Locations','^[0-9]+',0)).otherwise(lit('not_listed'))).show()
+--------------------+---------+---------------+
| Time|Locations|Locations_Clean|
+--------------------+---------+---------------+
|0.045454545454545456| 300abc| 300|
|0.022727272727272728| 300FFF| 300|
| 0.01515151515151515| 300ABC| 300|
|0.011363636363636364| 700abc| 700|
|0.009090909090909092| 750EEE| 750|
|0.007575757575757575| ABCabc| not_listed|
+--------------------+---------+---------------+
对于您的新问题,请使用 regexp_replace
df=df.withColumn('Locations_Clean', when(col("Locations").rlike("\d"),regexp_replace('Locations','\-\w+$','')).otherwise(lit('not_listed')))
+------+-----------+---------------+
| Time| Locations|Locations_Clean|
+------+-----------+---------------+
|1/1/22| A300-abc| A300|
|1/2/22| A300-FFF| A300|
|1/3/22|A300-ABC123| A300|
|1/4/22| B700-abc| B700|
|1/5/22| B750-EEE| B750|
|1/7/22| M-200-68| M-200|
|1/6/22| ABCabc| not_listed|
+------+-----------+---------------+
我有一个 table 看起来像这样:
+--------+-------------+
| Time | Locations |
+--------+-------------+
| 1/1/22 | A300-abc |
+--------+-------------+
| 1/2/22 | A300-FFF |
+--------+-------------+
| 1/3/22 | A300-ABC123 |
+--------+-------------+
| 1/4/22 | B700-abc |
+--------+-------------+
| 1/5/22 | B750-EEE |
+--------+-------------+
| 1/6/22 | M-200-68 |
+--------+-------------+
| 1/7/22 | ABC-abc |
+--------+-------------+
我想派生出如下所示的 table:
+--------+-------------+-----------------+
| Time | Locations | Locations_Clean |
+--------+-------------+-----------------+
| 1/1/22 | A300-abc | A300 |
+--------+-------------+-----------------+
| 1/2/22 | A300 FFF | A300 |
+--------+-------------+-----------------+
| 1/3/22 | A300-ABC123 | A300 |
+--------+-------------+-----------------+
| 1/4/22 | B700-abc | B700 |
+--------+-------------+-----------------+
| 1/5/22 | B750-EEE | B750 |
+--------+-------------+-----------------+
| 1/6/22 | M-200-68 | M-200 |
+--------+-------------+-----------------+
| 1/7/22 | ABC-abc | "not_listed" |
+--------+-------------+-----------------+
基本上我有一个位置代码应该是什么的列表,例如[“A300”,“B700”,“B750”,“M-200”],但目前位置栏与其他随机字符串非常混乱。我想创建一个新列来显示位置代码的“清理”版本,不在该列表中的任何内容都应标记为“not_listed”。
使用正则表达式和 when 条件。在这种情况下,我检查字符串是否以数字 ^[0-9]
开头,然后提取字符串中的前导数字。如果没有,则将其归为未列出。下面的代码
df=df.withColumn('Locations_Clean', when(col("Locations").rlike("^[0-9]"),regexp_extract('Locations','^[0-9]+',0)).otherwise(lit('not_listed'))).show()
+--------------------+---------+---------------+
| Time|Locations|Locations_Clean|
+--------------------+---------+---------------+
|0.045454545454545456| 300abc| 300|
|0.022727272727272728| 300FFF| 300|
| 0.01515151515151515| 300ABC| 300|
|0.011363636363636364| 700abc| 700|
|0.009090909090909092| 750EEE| 750|
|0.007575757575757575| ABCabc| not_listed|
+--------------------+---------+---------------+
对于您的新问题,请使用 regexp_replace
df=df.withColumn('Locations_Clean', when(col("Locations").rlike("\d"),regexp_replace('Locations','\-\w+$','')).otherwise(lit('not_listed')))
+------+-----------+---------------+
| Time| Locations|Locations_Clean|
+------+-----------+---------------+
|1/1/22| A300-abc| A300|
|1/2/22| A300-FFF| A300|
|1/3/22|A300-ABC123| A300|
|1/4/22| B700-abc| B700|
|1/5/22| B750-EEE| B750|
|1/7/22| M-200-68| M-200|
|1/6/22| ABCabc| not_listed|
+------+-----------+---------------+