脚本拆分后如何在块文件上保留 CSV headers?
How to keep CSV headers on the chunk files after script split?
我需要帮助修改此脚本以在输出文件块中包含 headers。该脚本使用一些输入来确定进程将文件拆分为每个文件的行数。输出文件不包含原始文件中的 headers。我正在寻求有关如何实施的建议。
import csv
import os
import sys
os_path = os.path
csv_writer = csv.writer
sys_exit = sys.exit
if __name__ == '__main__':
try:
chunk_size = int(input('Input number of rows of one chunk file: '))
except ValueError:
print('Number of rows must be integer. Close.')
sys_exit()
file_path = input('Input path to .tsv file for splitting on chunks: ')
if (
not os_path.isfile(file_path) or
not file_path.endswith('.tsv')
):
print('You must input path to .tsv file for splitting.')
sys_exit()
file_name = os_path.splitext(file_path)[0]
with open(file_path, 'r', newline='', encoding='utf-8') as tsv_file:
chunk_file = None
writer = None
counter = 1
reader = csv.reader(tsv_file, delimiter='\t', quotechar='\'')
for index, chunk in enumerate(reader):
if index % chunk_size == 0:
if chunk_file is not None:
chunk_file.close()
chunk_name = '{0}_{1}.tsv'.format(file_name, counter)
chunk_file = open(chunk_name, 'w', newline='', encoding='utf-8')
counter += 1
writer = csv_writer(chunk_file, delimiter='\t', quotechar='\'')
print('File "{}" complete.'.format(chunk_name))
writer.writerow(chunk)
您可以通过在打开输入文件时手动读取 header 行,然后将其写入每个输出文件的开头来实现 — 请参见下面代码中的 ADDED
注释:
...
with open(file_path, 'r', newline='', encoding='utf-8') as tsv_file:
chunk_file = None
writer = None
counter = 1
reader = csv.reader(tsv_file, delimiter='\t', quotechar="'")
header = next(reader) # Read and save header row. (ADDED)
for index, chunk in enumerate(reader):
if index % chunk_size == 0:
if chunk_file is not None:
chunk_file.close()
chunk_name = '{0}_{1}.tsv'.format(file_name, counter)
chunk_file = open(chunk_name, 'w', newline='', encoding='utf-8')
writer = csv_writer(chunk_file, delimiter='\t', quotechar="'")
writer.writerow(header) # ADDED.
print('File "{}" complete.'.format(chunk_name))
counter += 1
writer.writerow(chunk)
注意 使用 single-quote 个字符进行引用意味着输出文件不符合 CSV 标准:RFC 4180
我需要帮助修改此脚本以在输出文件块中包含 headers。该脚本使用一些输入来确定进程将文件拆分为每个文件的行数。输出文件不包含原始文件中的 headers。我正在寻求有关如何实施的建议。
import csv
import os
import sys
os_path = os.path
csv_writer = csv.writer
sys_exit = sys.exit
if __name__ == '__main__':
try:
chunk_size = int(input('Input number of rows of one chunk file: '))
except ValueError:
print('Number of rows must be integer. Close.')
sys_exit()
file_path = input('Input path to .tsv file for splitting on chunks: ')
if (
not os_path.isfile(file_path) or
not file_path.endswith('.tsv')
):
print('You must input path to .tsv file for splitting.')
sys_exit()
file_name = os_path.splitext(file_path)[0]
with open(file_path, 'r', newline='', encoding='utf-8') as tsv_file:
chunk_file = None
writer = None
counter = 1
reader = csv.reader(tsv_file, delimiter='\t', quotechar='\'')
for index, chunk in enumerate(reader):
if index % chunk_size == 0:
if chunk_file is not None:
chunk_file.close()
chunk_name = '{0}_{1}.tsv'.format(file_name, counter)
chunk_file = open(chunk_name, 'w', newline='', encoding='utf-8')
counter += 1
writer = csv_writer(chunk_file, delimiter='\t', quotechar='\'')
print('File "{}" complete.'.format(chunk_name))
writer.writerow(chunk)
您可以通过在打开输入文件时手动读取 header 行,然后将其写入每个输出文件的开头来实现 — 请参见下面代码中的 ADDED
注释:
...
with open(file_path, 'r', newline='', encoding='utf-8') as tsv_file:
chunk_file = None
writer = None
counter = 1
reader = csv.reader(tsv_file, delimiter='\t', quotechar="'")
header = next(reader) # Read and save header row. (ADDED)
for index, chunk in enumerate(reader):
if index % chunk_size == 0:
if chunk_file is not None:
chunk_file.close()
chunk_name = '{0}_{1}.tsv'.format(file_name, counter)
chunk_file = open(chunk_name, 'w', newline='', encoding='utf-8')
writer = csv_writer(chunk_file, delimiter='\t', quotechar="'")
writer.writerow(header) # ADDED.
print('File "{}" complete.'.format(chunk_name))
counter += 1
writer.writerow(chunk)
注意 使用 single-quote 个字符进行引用意味着输出文件不符合 CSV 标准:RFC 4180