Pyspark 将 Dataframe 字符串列拆分为多列
Pyspark Split Dataframe string column into multiple columns
我正在 spark 3.0.0 上执行 Spark Structure 流式传输的示例,为此,我正在使用 Twitter 数据。
我在 Kafka 中推送了 twitter 数据,单个记录看起来像这样
2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India
这里每个字段都用'|'分隔字段是
日期时间
用户 ID
推文
位置
现在在 Spark 中阅读这条消息,我得到了这样的数据框
key | value
-----+-------------------------
| 2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India
根据 this 的回答,我在我的应用程序中添加了以下代码块。
split_col = pyspark.sql.functions.split(df['value'], '|')
df = df.withColumn("Tweet Time", split_col.getItem(0))
df = df.withColumn("User ID", split_col.getItem(1))
df = df.withColumn("Tweet Text", split_col.getItem(2))
df = df.withColumn("Location", split_col.getItem(3))
df = df.drop("key")
但它给我这样的输出,
A | B | C | D | E |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+---------+--------+-----+
2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|2 | 0 | 2 | 0 |
但我想要这样的输出
Tweet Time | User ID | Tweet text | Location |
-----------------------+-------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+-------------------+
2020-07-21 10:48:19 | 1265200268284588034 | RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,… | Hyderabad, India |
因为它接受模式:表示正则表达式的字符串。正则表达式字符串应该是
一个 Java 正则表达式。
使用"\|"
按管道拆分或'[|]'
split_col = split(df.value, '\|',)
df = df.withColumn("Tweet Time", split_col.getItem(0))\
.withColumn("User ID", split_col.getItem(1))\
.withColumn("Tweet Text", split_col.getItem(2))\
.withColumn("Location", split_col.getItem(3))\
.drop("key")
输出:
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------+----------------+
|value |Tweet Time |User ID |Tweet Text |Location |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------+----------------+
|2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|
|2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------+----------------+
我正在 spark 3.0.0 上执行 Spark Structure 流式传输的示例,为此,我正在使用 Twitter 数据。 我在 Kafka 中推送了 twitter 数据,单个记录看起来像这样
2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India
这里每个字段都用'|'分隔字段是
日期时间
用户 ID
推文
位置
现在在 Spark 中阅读这条消息,我得到了这样的数据框
key | value
-----+-------------------------
| 2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India
根据 this 的回答,我在我的应用程序中添加了以下代码块。
split_col = pyspark.sql.functions.split(df['value'], '|')
df = df.withColumn("Tweet Time", split_col.getItem(0))
df = df.withColumn("User ID", split_col.getItem(1))
df = df.withColumn("Tweet Text", split_col.getItem(2))
df = df.withColumn("Location", split_col.getItem(3))
df = df.drop("key")
但它给我这样的输出,
A | B | C | D | E |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+---------+--------+-----+
2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|2 | 0 | 2 | 0 |
但我想要这样的输出
Tweet Time | User ID | Tweet text | Location |
-----------------------+-------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+-------------------+
2020-07-21 10:48:19 | 1265200268284588034 | RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,… | Hyderabad, India |
因为它接受模式:表示正则表达式的字符串。正则表达式字符串应该是 一个 Java 正则表达式。
使用"\|"
按管道拆分或'[|]'
split_col = split(df.value, '\|',)
df = df.withColumn("Tweet Time", split_col.getItem(0))\
.withColumn("User ID", split_col.getItem(1))\
.withColumn("Tweet Text", split_col.getItem(2))\
.withColumn("Location", split_col.getItem(3))\
.drop("key")
输出:
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------+----------------+
|value |Tweet Time |User ID |Tweet Text |Location |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------+----------------+
|2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|
|2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------+----------------+