pyspark 忽略 csv 文件中列内的换行符

pyspark ignore linefeed character within a column in a csv file

我有一个 csv 文件,其中的记录在一列 (COMMENT) 中有一个换行符。当我用 pyspark 读取文件时,记录跨越多行 (3)。即使使用 multiLine 选项,它也不是 work.Below 是我的代码

spark = SparkSession.builder.appName("ProviderAnalysis").master("local[*]").getOrCreate()
provider_df = (
                    spark.read.csv("./sample_data/pp.csv",header=True , inferSchema=True,multiLine=True)

)

以下是来自 csv 文件的记录,带有换行符

"A","B","C","D","COMMENTS","E","F","G","H","I","J","K","L","M","N","O","Q","R","S","T","U","V","X","Y","Z","AA","AB","AC","AD","AE","AF","AG","AH"
1,"S","S","R","Pxxxx xxx xxxx. xxxxx xxx ""xxxxx xxx xxxx."" xx xxx xxx xx xxx xxxxx xxxx xx 10/27/24.
xxx xxxxx xxxxx xxxx xxxxx xx 6/30/29 -yyy
10/26/2018 fffff ffffff ff: fffffff-ff","fff",,"","fff","ff","","f","","1","1","","",,"1","","","","","","","","","f","",5,"ffff","",""

如果我在 LibreOffice Calc 应用程序中打开文件,它会显示为一条记录,但 pypsark 会将其读取为 3 行

有没有人遇到过这个问题and/or谁能帮我解决这个问题。谢谢

尝试添加 escape 选项。双引号列 (COMMENTS) 中有双引号,因此需要转义其中的双引号。

provider_df = spark.read.csv("./sample_data/pp.csv",
                             header=True,
                             inferSchema=True,
                             multiLine=True,
                             escape='"')  # <-- added