如何在不使用外部库的情况下解析具有不同行元素的 CSV 文件？

Question

我正在尝试解析 Python 中的 CSV 文件；文件中的元素在第一行之后从 6 增加到 7。

CSV 示例：

Title,Name,Job,Email,Address,ID
Eng.,"FirstName, LastName",Engineer,email@company.com,ACME Company,1234567
Eng.,"FirstName, LastName",Engineer,email@company.com,ACME Company,1234567

我需要一种方法来格式化输出并将其呈现为干净的 table。

根据我的理解，我的代码的问题是从第二行开始，CSV 元素从 6 个增加到 7 个。因此，它会抛出以下错误。

print(stringFormat.format(item.split(',')[0], item.split(',')[1], item.split(',')[2],
                          item.split(',')[3], item.split(',')[4], item.split(',')[5],))
IndexError: list index out of range

我的代码：

stringFormat = "{:>10} {:>10} {:>10} {:>10} {:>10}  {:>10}"

with open("the_file", 'r') as file:
     for item in file.readlines():
            print(stringFormat.format(item.split(',')[0], item.split(',')[1],
                                      item.split(',')[2], item.split(',')[3],
                                      item.split(',')[4], item.split(',')[5],
                                      item.split(',')[6]))

Answer 1

你可以尝试这样的事情。 for 循环使用拆分项的长度，因此您可以使用长度可变的行。

stringFormats = ["{:>10}", "{:>10}", "{:>10}", "{:>10}", "{:>10}", "{:>10}", "{:>10}"]

with open("the_file", 'r') as file:
    for item in file.readlines():
        s_item = item.split(',')
        f_item = ''
        for x in range(len(s_item)):
            f_item += stringFormats[x].format(s_item[x])
        print(f_item)

当然，您至少需要足够的字符串格式来匹配最大的行长度。如果您永远不需要使用不同的选项，那么您可以将 stringFormat 改回单个字符串，而不是循环遍历它。

stringFormat = "{:>10}"

with open("the_file", 'r') as file:
    for item in file.readlines():
        s_item = item.split(',')
        f_item = ''
        for a_field in s_item:
            f_item += stringFormat.format(a_field)
        print(f_item)

Answer 2

您可以使用非常简单的 for 循环来完成此操作，如下所示。我添加了打印语句来显示效果

# 'r' is not needed, it is the default value if omitted
with open("file_name") as infile:
    result = []
    # split the read() into a list of lines
    # I prefer this over readlines() as this removes the EOL character
    # automagically (I mean the `\n` char) 
    for line in infile.read().splitlines():
        # check if line is empty (stripping all spaces)
        if len(line.strip()) == 0: 
            continue
        # another way would be to check for ',' characters
        if ',' not in line:
            continue
        # set some helper variables
        line_result = []
        found_quote = False
        element = ""
        # iterate over the line by character
        for c in line:
            # toggle the found_quote if quote found
            if c == '"':
                found_quote = not found_quote
                continue
            if c == ",":
                if found_quote:
                    element += c
                else:
                    # append the element to the line_result and reset element
                    line_result.append(element)
                    element = ""
            else:
                # append c to the element
                element += c
        # append leftover element to the line_result
        line_result.append(element)
        
        # append the line_result to the final result
        result.append(line_result)
        print(len(line_result), line_result)


print('------------------------------------------------------------')
stringFormat = "{:>10} {:>20} {:>20} {:>20} {:>20}  {:>10}"

for line in result:
    print(stringFormat.format(*line))

输出

6 ['Title', 'Name', 'Job', 'Email', 'Address', 'ID']
6 ['Eng.', 'FirstName, LastName', 'Engineer', 'email@company.com', 'ACME Company', '1234567']
6 ['Eng.', 'FirstName, LastName', 'Engineer', 'email@company.com', 'ACME Company', '1234567']
------------------------------------------------------------
     Title                 Name                  Job                Email              Address          ID
      Eng.  FirstName, LastName             Engineer    email@company.com         ACME Company     1234567
      Eng.  FirstName, LastName             Engineer    email@company.com         ACME Company     1234567

谈话后的一些调整

关于排序列表列表的注意事项。它将内部列表的第一个元素相互比较。如果它们匹配，它将内部列表的第二个元素相互比较，等等。因此，您可能希望将 ID 列移动到结果列表中的第二列，因为这似乎是所谓的唯一标识符 (UID).

with open("file_name") as infile:
    lines = infile.read().splitlines()

# set the header and remove it from lines.
header = lines.pop(0).split(',')
# rearrange the header to put the last element (date) first
# -1 gets the last element (eg, count from end)
header.insert(0, header.pop(-1))

# store the header length as this will speed up the process for longer files
# otherwise you would have to call len(header) in each iteration of the loop
header_len = len(header)

result = []
for line in lines:
    if ',' not in line:
        continue
    # split the line once here, so we don't have to split it a million
    # times in the rest of the loop
    split_line = line.split(',')
    if len(split_line) > header_len:
        # note, you can remove the strip('"') if you want to keep the quotation marks
        # also note that .pop() removes the element "in place", which is why I
        # use .pop(1) twice. first time it gets firstname, second time it gets lastname
        split_line.insert(1, f"{split_line.pop(1)},{split_line.pop(1)}".strip('"'))
    # move the date element to the start
    split_line.insert(0, split_line.pop(-1))
    # do some slicing on the date element to turn it into YYYYMMDD as this allows for
    # proper sorting without any hassle. I'm assuming the date you provided is in the format
    # MM/DD/YYYY. You can easily move the order around if it's DD/MM/YYYY
    # Also, pad day/month with leading zero's using f"{string:>02}"
    split_line[0] = f"{split_line[0].split('/')[2]}{split_line[0].split('/')[0]:>02}{split_line[0].split('/')[1]:>02}"
    result.append(split_line)

# sort it. Since the date is in numeric format, and the first element, it sorts 
# properly automagically
result.sort()

# if you want you can re-format the date again. you can do so with some list slicing
# since the date string is now properly formatted this is very easy to do
# because the sort() above happens outside the initial loop, we cannot do it inside said loop
for line in result:
    line[0] = f"{line[0][6:]}/{line[0][4:6]}/{line[0][0:4]}"

# insert the header
result.insert(0, header)


stringFormat = "{:>10} {:>25} {:>20} {:>20} {:>20} {:>10} {:>10}"
for line in result:
    print(stringFormat.format(*line))


# write it as a CSV file with ; used as separator instead
with open("output.csv", "w") as outfile:
    for line in result:
        outfile.write(";".join(line) + "\n")

如何在不使用外部库的情况下解析具有不同行元素的 CSV 文件？

How to parse a CSV file with different line elements without using an external library?

python

csv

谈话后的一些调整