读取 csv 并在 ASCII 字符 pyspark 上加入行

Question

我有一个格式如下的 csv 文件 -

id1,"When I think about the short time that we live and relate it to á
the periods of my life when I think that I did not use this á
short time."
id2,"[ On days when I feel close to my partner and other friends.  á
When I feel at peace with myself and also experience a close á
contact with people whom I regard greatly.]"

我想在 pyspark 中阅读。我的代码是 -

schema = StructType([
    StructField("Id", StringType()),
    StructField("Sentence", StringType()),
  ])

df = sqlContext.read.format("com.databricks.spark.csv") \
        .option("header", "false") \
        .option("inferSchema", "false") \
        .option("delimiter", "\"") \
        .schema(schema) \
        .load("mycsv.csv")

但我得到的结果是 -

+--------------------------------------------------------------+-------------------------------------------------------------------+
| Id                                                           | Sentence                                                           |
+--------------------------------------------------------------+-------------------------------------------------------------------+
|id1,                                                          |When I think about the short time that we live and relate it to á  |
|the periods of my life when I think that I did not use this á |null                                                               |
|short time.                                                   |"                                                                  |

...

我想在包含 Id 和其他 Sentence 的 2 列中阅读它。并且句子应该加入 ASCII 字符 á，因为我看到它在下一行阅读而没有得到分隔符 .

我的输出应该是这样的 -

    +--------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
    | Id                                                           | Sentence                                                                 |
    +--------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------+
    |id1,                                                          |When I think about the short time that we live and relate it to the periods of my life when I think that I did not use this short time. |

例子中我只考虑了一个id。我的代码需要做哪些修改？

Answer 1

只需将 Spark 更新到 2.2 或更高版本，如果您还没有更新并使用 multiline 选项：

df = spark.read
    .option("header", "false") \
    .option("inferSchema", "false") \
    .option("delimiter", "\"") \
    .schema(schema) \
    .csv("mycsv.csv", multiLine=True)

如果这样做，您可以使用 regexp_replace:

删除 á

df.withColumn("Sentence", regexp_replace("Sentence", "á", "")

读取 csv 并在 ASCII 字符 pyspark 上加入行

Read csv and join lines on a ASCII character pyspark

apache-spark

pyspark

pyspark-sql