Spark 无法识别 String 中的新行、& 等
Spark does not recognize new lines, &, etc. from String
我正在尝试使用 PySpark 处理文本数据(Twitter 推文)。表情符号和特殊字符正确显示为红色,但“\n”、“&”似乎被转义了。 Spark 无法识别它们。可能其他人也是。我的 Spark DF 中的一条推文示例如下所示:
- “你好everyone\n\nHow还好吗?保重并享受”
我希望 Spark 能够正确阅读它们。这些文件存储为镶木地板,我正在这样阅读它们:
tweets = spark.read.format('parquet')\
.option('header', 'True')\
.option('encoding', 'utf-8')\
.load(path)
下面是我从原始 JSONL 文件中提取的一些示例输入数据(我后来将数据存储为 parquet)。
"full_text": "RT @OurWarOnCancer:我们的联邦疫苗接种在哪里
HPV 教育活动?!我们的联邦 #lungcancer 在哪里
筛选程序?! (and\u2026"
"full_text": "\u2b55\ufe0f#HPV 是
的最重要原因
#CervicalCancer 但它不仅会导致宫颈癌(参见 figure\ud83d\udc47)\n\u2b55\ufe0f这意味着它们可以被预防
直接从 JSONL 文件读取会导致相同的识别问题。
tweets = spark.read.\
.option('encoding', 'utf-8')\
.json(path)
Spark如何正确识别它们?提前谢谢你。
以下代码可能有助于解决您的问题,
采用的输入:
"Hello everyone\n\nHow is it going? Take care & enjoy"
"full_text": "RT @OurWarOnCancer: Where is our FEDERAL vaccination education campaign for HPV?! Where is our FEDERAL #lungcancer screening program?! (and\u2026 &"
"full_text": "\u2b55\ufe0f#HPV is the most important cause of #CervicalCancer But it doesn't just cause cervical cancer (see the figure\ud83d\udc47) \n\u2b55\ufe0fThat means they can be PREVENTED @theNCI @NCIprevention @AmericanCancer @cancereu @uicc @IARCWHO @EuropeanCancer @KanserSavascisi @AUTF_DEKANLIK @OncoAlert"
问题解决代码:
from pyspark.sql.functions import *
df=spark.read.csv("file:///home/sathya/Desktop/stackoverflo/raw-data/input.tweet")
df1=df.withColumn("cleandata",regexp_replace('_c0', '&|\\n', ''))
df1.select("cleandata").show(truncate=False)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|cleandata |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Hello everyoneHow is it going? Take care & enjoy |
|"full_text": "RT @OurWarOnCancer: Where is our FEDERAL vaccination education campaign for HPV?! Where is our FEDERAL #lungcancer screening program?! (and\u2026 &" |
|"full_text": "\u2b55\ufe0f#HPV is the most important cause of #CervicalCancer But it doesn't just cause cervical cancer (see the figure\ud83d\udc47) \u2b55\ufe0fThat means they can be PREVENTED @theNCI @NCIprevention @AmericanCancer @cancereu @uicc @IARCWHO @EuropeanCancer @KanserSavascisi @AUTF_DEKANLIK @OncoAlert"|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
我正在尝试使用 PySpark 处理文本数据(Twitter 推文)。表情符号和特殊字符正确显示为红色,但“\n”、“&”似乎被转义了。 Spark 无法识别它们。可能其他人也是。我的 Spark DF 中的一条推文示例如下所示:
- “你好everyone\n\nHow还好吗?保重并享受”
我希望 Spark 能够正确阅读它们。这些文件存储为镶木地板,我正在这样阅读它们:
tweets = spark.read.format('parquet')\
.option('header', 'True')\
.option('encoding', 'utf-8')\
.load(path)
下面是我从原始 JSONL 文件中提取的一些示例输入数据(我后来将数据存储为 parquet)。
"full_text": "RT @OurWarOnCancer:我们的联邦疫苗接种在哪里 HPV 教育活动?!我们的联邦 #lungcancer 在哪里 筛选程序?! (and\u2026"
"full_text": "\u2b55\ufe0f#HPV 是
的最重要原因 #CervicalCancer 但它不仅会导致宫颈癌(参见 figure\ud83d\udc47)\n\u2b55\ufe0f这意味着它们可以被预防
直接从 JSONL 文件读取会导致相同的识别问题。
tweets = spark.read.\
.option('encoding', 'utf-8')\
.json(path)
Spark如何正确识别它们?提前谢谢你。
以下代码可能有助于解决您的问题,
采用的输入:
"Hello everyone\n\nHow is it going? Take care & enjoy"
"full_text": "RT @OurWarOnCancer: Where is our FEDERAL vaccination education campaign for HPV?! Where is our FEDERAL #lungcancer screening program?! (and\u2026 &"
"full_text": "\u2b55\ufe0f#HPV is the most important cause of #CervicalCancer But it doesn't just cause cervical cancer (see the figure\ud83d\udc47) \n\u2b55\ufe0fThat means they can be PREVENTED @theNCI @NCIprevention @AmericanCancer @cancereu @uicc @IARCWHO @EuropeanCancer @KanserSavascisi @AUTF_DEKANLIK @OncoAlert"
问题解决代码:
from pyspark.sql.functions import *
df=spark.read.csv("file:///home/sathya/Desktop/stackoverflo/raw-data/input.tweet")
df1=df.withColumn("cleandata",regexp_replace('_c0', '&|\\n', ''))
df1.select("cleandata").show(truncate=False)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|cleandata |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|Hello everyoneHow is it going? Take care & enjoy |
|"full_text": "RT @OurWarOnCancer: Where is our FEDERAL vaccination education campaign for HPV?! Where is our FEDERAL #lungcancer screening program?! (and\u2026 &" |
|"full_text": "\u2b55\ufe0f#HPV is the most important cause of #CervicalCancer But it doesn't just cause cervical cancer (see the figure\ud83d\udc47) \u2b55\ufe0fThat means they can be PREVENTED @theNCI @NCIprevention @AmericanCancer @cancereu @uicc @IARCWHO @EuropeanCancer @KanserSavascisi @AUTF_DEKANLIK @OncoAlert"|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+