ParseError: Error tokenizing data. C error: Expected 50 fields in line 224599, saw 51

Question

我正在尝试 pd.concat 主 CSV 中的多个 .xlsx 文件，然后将此 CSV 与同样采用 CSV 格式的过去 CPU 数据合并。

第一个操作是成功的（8 个操作中的第 3 个），但是在第二个过程中（历史记录 + CSV 格式的当前数据 - 8 个操作中的第 7 个）我得到了如下所示的 ParseError。

我检查了两个文件，似乎没有分隔符冲突，数据在正确的列中等等。

Error tokenizing data. C error: Expected 50 fields in line 224599, saw 51

我的代码如下：

import pandas as pd
import os
import glob

def sremove(fn):
    os.remove(fn) if os.path.exists(fn) else None

def mergeit():
    df = pd.concat(pd.read_excel(fl) for fl in path1)
    df.to_csv(path2, index = False)

def mergeit2():
    df = pd.concat(pd.read_csv(fl) for fl in path1)
    df.to_csv(path2, index = False)


print("\n#Operation 3 - Incidents Dataset")
print("Incidents Dataset operation has started")
fn = "S:\CPU CacheU Data\201920\Incidents_201920.csv"
sremove (fn)
print("Incidents 2019/20 file has been deleted - Operation 1 of 8")
path1 = glob.glob('S:\*CPU CacheU Data\*Inc Dataset\Incidents Dataset*.xlsx')
print ("Path 1 - Incidents 2019/20 folder has been read successfully - Operation 2 of 8")
path2 = "S:\CPU CacheU Data\Incidents_201920.csv"
print ("Path 2 - Incidents 2019/20 Dataset File has been read successfully - Operation 3 of 8")
mergeit()
print ("Action has been completed successfully - Incidents Dataset 2019/20 Updated - Operation 4 of 8")
fn = "S:\CPU CacheU Data\Incidents_Dataset.csv"
sremove(fn)
print (" Incidents Dataset Old file has been deleted - Operation 5 of 8")
path1 = glob.glob('S:\*CPU CacheU Data\*Incidents_*.csv')
print ("Path 1 - Incidents folder has been read successfully - Operation 6 of 8")
path2 = "S:\CPU CacheU Data\Incidents_Dataset.csv"
print ("Path 2 - Incidents Dataset File has been read successfully - Operation 7 of 8")
mergeit2()
print ("Path 2 - Incidents Dataset File has been updated successfully - Operation 8 of 8")

一些注意事项：

1) Op 3 out of 8 需要很长时间才能运行。我不确定这是不是因为 xlsx 到 csv 的转换。

2) 我试图在 def mergeit2() 函数中添加 error_bad_lines = False 语句，但生成主文件似乎需要很长时间。

Answer 1

检查您的 csv 文件中的分隔符，可能单元格中有更多逗号，read_csv 默认使用 sep=',' Propably 你应该设置不同的分隔符来打开你的 csv 文件 pd.read_csv(sep=' ')