读取 csv 并在 ASCII 字符 pyspark 上加入行
Read csv and join lines on a ASCII character pyspark
我有一个格式如下的 csv 文件 -
id1,"When I think about the short time that we live and relate it to á
the periods of my life when I think that I did not use this á
short time."
id2,"[ On days when I feel close to my partner and other friends. á
When I feel at peace with myself and also experience a close á
contact with people whom I regard greatly.]"
我想在 pyspark 中阅读。我的代码是 -
schema = StructType([
StructField("Id", StringType()),
StructField("Sentence", StringType()),
])
df = sqlContext.read.format("com.databricks.spark.csv") \
.option("header", "false") \
.option("inferSchema", "false") \
.option("delimiter", "\"") \
.schema(schema) \
.load("mycsv.csv")
但我得到的结果是 -
+--------------------------------------------------------------+-------------------------------------------------------------------+
| Id | Sentence |
+--------------------------------------------------------------+-------------------------------------------------------------------+
|id1, |When I think about the short time that we live and relate it to á |
|the periods of my life when I think that I did not use this á |null |
|short time. |" |
...
我想在包含 Id
和其他 Sentence
的 2 列中阅读它。
并且句子应该加入 ASCII 字符 á
,因为我看到它在下一行阅读而没有得到分隔符 .
我的输出应该是这样的 -
+--------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| Id | Sentence |
+--------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|id1, |When I think about the short time that we live and relate it to the periods of my life when I think that I did not use this short time. |
例子中我只考虑了一个id。
我的代码需要做哪些修改?
只需将 Spark 更新到 2.2 或更高版本,如果您还没有更新并使用 multiline
选项:
df = spark.read
.option("header", "false") \
.option("inferSchema", "false") \
.option("delimiter", "\"") \
.schema(schema) \
.csv("mycsv.csv", multiLine=True)
如果这样做,您可以使用 regexp_replace
:
删除 á
df.withColumn("Sentence", regexp_replace("Sentence", "á", "")
我有一个格式如下的 csv 文件 -
id1,"When I think about the short time that we live and relate it to á
the periods of my life when I think that I did not use this á
short time."
id2,"[ On days when I feel close to my partner and other friends. á
When I feel at peace with myself and also experience a close á
contact with people whom I regard greatly.]"
我想在 pyspark 中阅读。我的代码是 -
schema = StructType([
StructField("Id", StringType()),
StructField("Sentence", StringType()),
])
df = sqlContext.read.format("com.databricks.spark.csv") \
.option("header", "false") \
.option("inferSchema", "false") \
.option("delimiter", "\"") \
.schema(schema) \
.load("mycsv.csv")
但我得到的结果是 -
+--------------------------------------------------------------+-------------------------------------------------------------------+
| Id | Sentence |
+--------------------------------------------------------------+-------------------------------------------------------------------+
|id1, |When I think about the short time that we live and relate it to á |
|the periods of my life when I think that I did not use this á |null |
|short time. |" |
...
我想在包含 Id
和其他 Sentence
的 2 列中阅读它。
并且句子应该加入 ASCII 字符 á
,因为我看到它在下一行阅读而没有得到分隔符 .
我的输出应该是这样的 -
+--------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
| Id | Sentence |
+--------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
|id1, |When I think about the short time that we live and relate it to the periods of my life when I think that I did not use this short time. |
例子中我只考虑了一个id。 我的代码需要做哪些修改?
只需将 Spark 更新到 2.2 或更高版本,如果您还没有更新并使用 multiline
选项:
df = spark.read
.option("header", "false") \
.option("inferSchema", "false") \
.option("delimiter", "\"") \
.schema(schema) \
.csv("mycsv.csv", multiLine=True)
如果这样做,您可以使用 regexp_replace
:
á
df.withColumn("Sentence", regexp_replace("Sentence", "á", "")