如何在不使用外部库的情况下解析具有不同行元素的 CSV 文件?
How to parse a CSV file with different line elements without using an external library?
我正在尝试解析 Python 中的 CSV 文件;文件中的元素在第一行之后从 6 增加到 7。
CSV 示例:
Title,Name,Job,Email,Address,ID
Eng.,"FirstName, LastName",Engineer,email@company.com,ACME Company,1234567
Eng.,"FirstName, LastName",Engineer,email@company.com,ACME Company,1234567
我需要一种方法来格式化输出并将其呈现为干净的 table。
根据我的理解,我的代码的问题是从第二行开始,CSV 元素从 6 个增加到 7 个。因此,它会抛出以下错误。
print(stringFormat.format(item.split(',')[0], item.split(',')[1], item.split(',')[2],
item.split(',')[3], item.split(',')[4], item.split(',')[5],))
IndexError: list index out of range
我的代码:
stringFormat = "{:>10} {:>10} {:>10} {:>10} {:>10} {:>10}"
with open("the_file", 'r') as file:
for item in file.readlines():
print(stringFormat.format(item.split(',')[0], item.split(',')[1],
item.split(',')[2], item.split(',')[3],
item.split(',')[4], item.split(',')[5],
item.split(',')[6]))
你可以尝试这样的事情。 for 循环使用拆分项的长度,因此您可以使用长度可变的行。
stringFormats = ["{:>10}", "{:>10}", "{:>10}", "{:>10}", "{:>10}", "{:>10}", "{:>10}"]
with open("the_file", 'r') as file:
for item in file.readlines():
s_item = item.split(',')
f_item = ''
for x in range(len(s_item)):
f_item += stringFormats[x].format(s_item[x])
print(f_item)
当然,您至少需要足够的字符串格式来匹配最大的行长度。如果您永远不需要使用不同的选项,那么您可以将 stringFormat 改回单个字符串,而不是循环遍历它。
stringFormat = "{:>10}"
with open("the_file", 'r') as file:
for item in file.readlines():
s_item = item.split(',')
f_item = ''
for a_field in s_item:
f_item += stringFormat.format(a_field)
print(f_item)
您可以使用非常简单的 for 循环来完成此操作,如下所示。我添加了打印语句来显示效果
# 'r' is not needed, it is the default value if omitted
with open("file_name") as infile:
result = []
# split the read() into a list of lines
# I prefer this over readlines() as this removes the EOL character
# automagically (I mean the `\n` char)
for line in infile.read().splitlines():
# check if line is empty (stripping all spaces)
if len(line.strip()) == 0:
continue
# another way would be to check for ',' characters
if ',' not in line:
continue
# set some helper variables
line_result = []
found_quote = False
element = ""
# iterate over the line by character
for c in line:
# toggle the found_quote if quote found
if c == '"':
found_quote = not found_quote
continue
if c == ",":
if found_quote:
element += c
else:
# append the element to the line_result and reset element
line_result.append(element)
element = ""
else:
# append c to the element
element += c
# append leftover element to the line_result
line_result.append(element)
# append the line_result to the final result
result.append(line_result)
print(len(line_result), line_result)
print('------------------------------------------------------------')
stringFormat = "{:>10} {:>20} {:>20} {:>20} {:>20} {:>10}"
for line in result:
print(stringFormat.format(*line))
输出
6 ['Title', 'Name', 'Job', 'Email', 'Address', 'ID']
6 ['Eng.', 'FirstName, LastName', 'Engineer', 'email@company.com', 'ACME Company', '1234567']
6 ['Eng.', 'FirstName, LastName', 'Engineer', 'email@company.com', 'ACME Company', '1234567']
------------------------------------------------------------
Title Name Job Email Address ID
Eng. FirstName, LastName Engineer email@company.com ACME Company 1234567
Eng. FirstName, LastName Engineer email@company.com ACME Company 1234567
谈话后的一些调整
关于排序列表列表的注意事项。它将内部列表的第一个元素相互比较。如果它们匹配,它将内部列表的第二个元素相互比较,等等。因此,您可能希望将 ID 列移动到结果列表中的第二列,因为这似乎是所谓的唯一标识符 (UID).
with open("file_name") as infile:
lines = infile.read().splitlines()
# set the header and remove it from lines.
header = lines.pop(0).split(',')
# rearrange the header to put the last element (date) first
# -1 gets the last element (eg, count from end)
header.insert(0, header.pop(-1))
# store the header length as this will speed up the process for longer files
# otherwise you would have to call len(header) in each iteration of the loop
header_len = len(header)
result = []
for line in lines:
if ',' not in line:
continue
# split the line once here, so we don't have to split it a million
# times in the rest of the loop
split_line = line.split(',')
if len(split_line) > header_len:
# note, you can remove the strip('"') if you want to keep the quotation marks
# also note that .pop() removes the element "in place", which is why I
# use .pop(1) twice. first time it gets firstname, second time it gets lastname
split_line.insert(1, f"{split_line.pop(1)},{split_line.pop(1)}".strip('"'))
# move the date element to the start
split_line.insert(0, split_line.pop(-1))
# do some slicing on the date element to turn it into YYYYMMDD as this allows for
# proper sorting without any hassle. I'm assuming the date you provided is in the format
# MM/DD/YYYY. You can easily move the order around if it's DD/MM/YYYY
# Also, pad day/month with leading zero's using f"{string:>02}"
split_line[0] = f"{split_line[0].split('/')[2]}{split_line[0].split('/')[0]:>02}{split_line[0].split('/')[1]:>02}"
result.append(split_line)
# sort it. Since the date is in numeric format, and the first element, it sorts
# properly automagically
result.sort()
# if you want you can re-format the date again. you can do so with some list slicing
# since the date string is now properly formatted this is very easy to do
# because the sort() above happens outside the initial loop, we cannot do it inside said loop
for line in result:
line[0] = f"{line[0][6:]}/{line[0][4:6]}/{line[0][0:4]}"
# insert the header
result.insert(0, header)
stringFormat = "{:>10} {:>25} {:>20} {:>20} {:>20} {:>10} {:>10}"
for line in result:
print(stringFormat.format(*line))
# write it as a CSV file with ; used as separator instead
with open("output.csv", "w") as outfile:
for line in result:
outfile.write(";".join(line) + "\n")
我正在尝试解析 Python 中的 CSV 文件;文件中的元素在第一行之后从 6 增加到 7。
CSV 示例:
Title,Name,Job,Email,Address,ID
Eng.,"FirstName, LastName",Engineer,email@company.com,ACME Company,1234567
Eng.,"FirstName, LastName",Engineer,email@company.com,ACME Company,1234567
我需要一种方法来格式化输出并将其呈现为干净的 table。
根据我的理解,我的代码的问题是从第二行开始,CSV 元素从 6 个增加到 7 个。因此,它会抛出以下错误。
print(stringFormat.format(item.split(',')[0], item.split(',')[1], item.split(',')[2],
item.split(',')[3], item.split(',')[4], item.split(',')[5],))
IndexError: list index out of range
我的代码:
stringFormat = "{:>10} {:>10} {:>10} {:>10} {:>10} {:>10}"
with open("the_file", 'r') as file:
for item in file.readlines():
print(stringFormat.format(item.split(',')[0], item.split(',')[1],
item.split(',')[2], item.split(',')[3],
item.split(',')[4], item.split(',')[5],
item.split(',')[6]))
你可以尝试这样的事情。 for 循环使用拆分项的长度,因此您可以使用长度可变的行。
stringFormats = ["{:>10}", "{:>10}", "{:>10}", "{:>10}", "{:>10}", "{:>10}", "{:>10}"]
with open("the_file", 'r') as file:
for item in file.readlines():
s_item = item.split(',')
f_item = ''
for x in range(len(s_item)):
f_item += stringFormats[x].format(s_item[x])
print(f_item)
当然,您至少需要足够的字符串格式来匹配最大的行长度。如果您永远不需要使用不同的选项,那么您可以将 stringFormat 改回单个字符串,而不是循环遍历它。
stringFormat = "{:>10}"
with open("the_file", 'r') as file:
for item in file.readlines():
s_item = item.split(',')
f_item = ''
for a_field in s_item:
f_item += stringFormat.format(a_field)
print(f_item)
您可以使用非常简单的 for 循环来完成此操作,如下所示。我添加了打印语句来显示效果
# 'r' is not needed, it is the default value if omitted
with open("file_name") as infile:
result = []
# split the read() into a list of lines
# I prefer this over readlines() as this removes the EOL character
# automagically (I mean the `\n` char)
for line in infile.read().splitlines():
# check if line is empty (stripping all spaces)
if len(line.strip()) == 0:
continue
# another way would be to check for ',' characters
if ',' not in line:
continue
# set some helper variables
line_result = []
found_quote = False
element = ""
# iterate over the line by character
for c in line:
# toggle the found_quote if quote found
if c == '"':
found_quote = not found_quote
continue
if c == ",":
if found_quote:
element += c
else:
# append the element to the line_result and reset element
line_result.append(element)
element = ""
else:
# append c to the element
element += c
# append leftover element to the line_result
line_result.append(element)
# append the line_result to the final result
result.append(line_result)
print(len(line_result), line_result)
print('------------------------------------------------------------')
stringFormat = "{:>10} {:>20} {:>20} {:>20} {:>20} {:>10}"
for line in result:
print(stringFormat.format(*line))
输出
6 ['Title', 'Name', 'Job', 'Email', 'Address', 'ID']
6 ['Eng.', 'FirstName, LastName', 'Engineer', 'email@company.com', 'ACME Company', '1234567']
6 ['Eng.', 'FirstName, LastName', 'Engineer', 'email@company.com', 'ACME Company', '1234567']
------------------------------------------------------------
Title Name Job Email Address ID
Eng. FirstName, LastName Engineer email@company.com ACME Company 1234567
Eng. FirstName, LastName Engineer email@company.com ACME Company 1234567
谈话后的一些调整
关于排序列表列表的注意事项。它将内部列表的第一个元素相互比较。如果它们匹配,它将内部列表的第二个元素相互比较,等等。因此,您可能希望将 ID 列移动到结果列表中的第二列,因为这似乎是所谓的唯一标识符 (UID).
with open("file_name") as infile:
lines = infile.read().splitlines()
# set the header and remove it from lines.
header = lines.pop(0).split(',')
# rearrange the header to put the last element (date) first
# -1 gets the last element (eg, count from end)
header.insert(0, header.pop(-1))
# store the header length as this will speed up the process for longer files
# otherwise you would have to call len(header) in each iteration of the loop
header_len = len(header)
result = []
for line in lines:
if ',' not in line:
continue
# split the line once here, so we don't have to split it a million
# times in the rest of the loop
split_line = line.split(',')
if len(split_line) > header_len:
# note, you can remove the strip('"') if you want to keep the quotation marks
# also note that .pop() removes the element "in place", which is why I
# use .pop(1) twice. first time it gets firstname, second time it gets lastname
split_line.insert(1, f"{split_line.pop(1)},{split_line.pop(1)}".strip('"'))
# move the date element to the start
split_line.insert(0, split_line.pop(-1))
# do some slicing on the date element to turn it into YYYYMMDD as this allows for
# proper sorting without any hassle. I'm assuming the date you provided is in the format
# MM/DD/YYYY. You can easily move the order around if it's DD/MM/YYYY
# Also, pad day/month with leading zero's using f"{string:>02}"
split_line[0] = f"{split_line[0].split('/')[2]}{split_line[0].split('/')[0]:>02}{split_line[0].split('/')[1]:>02}"
result.append(split_line)
# sort it. Since the date is in numeric format, and the first element, it sorts
# properly automagically
result.sort()
# if you want you can re-format the date again. you can do so with some list slicing
# since the date string is now properly formatted this is very easy to do
# because the sort() above happens outside the initial loop, we cannot do it inside said loop
for line in result:
line[0] = f"{line[0][6:]}/{line[0][4:6]}/{line[0][0:4]}"
# insert the header
result.insert(0, header)
stringFormat = "{:>10} {:>25} {:>20} {:>20} {:>20} {:>10} {:>10}"
for line in result:
print(stringFormat.format(*line))
# write it as a CSV file with ; used as separator instead
with open("output.csv", "w") as outfile:
for line in result:
outfile.write(";".join(line) + "\n")