如何根据某些字符串是否存在于另一列中来创建新列？

Question

我有一个 table 看起来像这样：

+--------+-------------+
| Time   | Locations   |
+--------+-------------+
| 1/1/22 | A300-abc    |
+--------+-------------+
| 1/2/22 | A300-FFF    |
+--------+-------------+
| 1/3/22 | A300-ABC123 |
+--------+-------------+
| 1/4/22 | B700-abc    |
+--------+-------------+
| 1/5/22 | B750-EEE    |
+--------+-------------+
| 1/6/22 | M-200-68    |
+--------+-------------+
| 1/7/22 | ABC-abc     |
+--------+-------------+

我想派生出如下所示的 table：

+--------+-------------+-----------------+
| Time   | Locations   | Locations_Clean |
+--------+-------------+-----------------+
| 1/1/22 | A300-abc    | A300            |
+--------+-------------+-----------------+
| 1/2/22 | A300 FFF    | A300            |
+--------+-------------+-----------------+
| 1/3/22 | A300-ABC123 | A300            |
+--------+-------------+-----------------+
| 1/4/22 | B700-abc    | B700            |
+--------+-------------+-----------------+
| 1/5/22 | B750-EEE    | B750            |
+--------+-------------+-----------------+
| 1/6/22 | M-200-68    | M-200           |
+--------+-------------+-----------------+
| 1/7/22 | ABC-abc     | "not_listed"    |
+--------+-------------+-----------------+

基本上我有一个位置代码应该是什么的列表，例如[“A300”，“B700”，“B750”，“M-200”]，但目前位置栏与其他随机字符串非常混乱。我想创建一个新列来显示位置代码的“清理”版本，不在该列表中的任何内容都应标记为“not_listed”。

Answer 1

使用正则表达式和 when 条件。在这种情况下，我检查字符串是否以数字 ^[0-9] 开头，然后提取字符串中的前导数字。如果没有，则将其归为未列出。下面的代码

df=df.withColumn('Locations_Clean', when(col("Locations").rlike("^[0-9]"),regexp_extract('Locations','^[0-9]+',0)).otherwise(lit('not_listed'))).show()

+--------------------+---------+---------------+
|                Time|Locations|Locations_Clean|
+--------------------+---------+---------------+
|0.045454545454545456|   300abc|            300|
|0.022727272727272728|   300FFF|            300|
| 0.01515151515151515|   300ABC|            300|
|0.011363636363636364|   700abc|            700|
|0.009090909090909092|   750EEE|            750|
|0.007575757575757575|   ABCabc|     not_listed|
+--------------------+---------+---------------+

对于您的新问题，请使用 regexp_replace

df=df.withColumn('Locations_Clean', when(col("Locations").rlike("\d"),regexp_replace('Locations','\-\w+$','')).otherwise(lit('not_listed')))

+------+-----------+---------------+
|  Time|  Locations|Locations_Clean|
+------+-----------+---------------+
|1/1/22|   A300-abc|           A300|
|1/2/22|   A300-FFF|           A300|
|1/3/22|A300-ABC123|           A300|
|1/4/22|   B700-abc|           B700|
|1/5/22|   B750-EEE|           B750|
|1/7/22|   M-200-68|          M-200|
|1/6/22|     ABCabc|     not_listed|
+------+-----------+---------------+

如何根据某些字符串是否存在于另一列中来创建新列？

How to create a new column based on if certain strings exist in another column?

pyspark