如果数据未记录在列中，则使用 python 从 CSV 中删除一行

Question

我正在尝试将一批 CSV 导入 PostgreSQL 并不断运行遇到数据丢失的问题：

psycopg2.DataError: missing data for column "column_name" CONTEXT:
COPY table_name, line where ever in the CSV that data wasn't
recorded, and here are data values up to the missing column.

有时无法将完整的数据集写入行，我必须按原样处理文件。如果数据未记录到任何列中，我正在尝试找出一种删除行的方法。这是我拥有的：

file_list = glob.glob(path)

for f in file_list:
    filename = os.path.basename(f) #get the file name
    arc_csv = arc_path + filename #path for revised copy of CSV

    with open(f, 'r') as inp, open(arc_csv, 'wb') as out:
        writer = csv.writer(out)
        for line in csv.reader(inp):
            if "" not in line: #if the row doesn't have any empty fields
                writer.writerow(line)

    cursor.execute("COPY table_name FROM %s WITH CSV HEADER DELIMITER ','",(arc_csv,))

Answer 1

不幸的是，您不能参数化table或列名。使用字符串格式，但确保 validate/escape 值正确：

cursor.execute("COPY table_name FROM {column_name} WITH CSV HEADER DELIMITER ','".format(column_name=arc_csv))

Answer 2

您可以使用 pandas 删除具有缺失值的行：

import glob, os, pandas

file_list = glob.glob(path)

for f in file_list:
    filename = os.path.basename(f)
    arc_csv = arc_path + filename
    data = pandas.read_csv(f, index_col=0)
    ind = data.apply(lambda x: not pandas.isnull(x.values).any(), axis=1)
    # ^ provides an index of all rows with no missing data
    data[ind].to_csv(arc_csv) # writes the revised data to csv

但是，如果您正在处理大型数据集，这可能会变慢。

编辑 - 添加 index_col=0 作为 pandas.read_csv() 的参数以防止添加的索引列问题。这使用 csv 中的第一列作为现有索引。如果您有理由不使用第一列作为索引，请将 0 替换为另一列的编号。

如果数据未记录在列中，则使用 python 从 CSV 中删除一行

Removing a row from CSV with python if data wasn't recorded in a column

python

csv

postgresql

psycopg2