Python,用于迭代文件或另一个行生成器的函数
Python, function to iterate over a file or another row generator
我有许多函数需要 CSV 文件的路径,或者可以转换为 CSV 行生成器的源,例如标准输入或列表(用于执行测试)。
我写过这个函数:
def process_doc_rows ( rows_source ):
if isinstance ( rows_source, str ):
# Open the file with the csv reader
with open ( rows_source ) as csvf:
yield from process_doc_rows ( csvf )
return
elif isinstance ( rows_source, io.TextIOBase ):
# This includes stdin
rows_source = csv.reader ( rows_source, delimiter = "\t" )
yield from rows_source
我可以这样调用:
def process_rows ( rows_src ):
l = 0; res = "";
for row in process_doc_rows ( rows_src ):
print ( "L:%d, Name:%s, Surname:%s" % ( l, row [ 0 ], row [ 1 ] ) )
l += 1
process_rows ( "/path/to/file.tsv" )
process_rows ( [ ["John", "Smith"], ["Karl", "Marx"], ["Emmanuel", "Kant"] ] )
process_rows ( sys.stdin )
现在,我的问题是:我是在重新发明轮子吗?是否有任何实用程序可以做同样的事情并且我不必重写?
编辑
感谢,我以更简单的方式重写了上面的函数。旧版本是:
def process_doc_rows ( rows_source ):
if ( type ( rows_source ) is str ):
# Open the file with the csv reader
with open ( rows_source ) as csvf:
for row in process_doc_rows ( csvf ):
yield row
return
elif isinstance ( rows_source, io.TextIOWrapper ):
# This includes stdin
rows_source = csv.reader ( rows_source, delimiter = "\t" )
elif isinstance ( rows_source, Iterable ) or isinstance ( rows_source, Iterator ):
rows_source = (e for e in rows_source)
elif isinstance ( rows_source, types.GeneratorType ):
raise TypeError ( "This function wants a file or a CSV-like generator" )
# We must do this in order to avoid the mix of return and yield
for row in rows_source:
yield row
Is there any utility around that does the same and that I don't have to rewrite?
不一定,API 通常会尝试比“文件名或文件对象或迭代器”定义得更明确,但我希望那些以这种方式工作的 API 也可以手动处理它。
你的系统虽然过于复杂(而且是错误的):可迭代对象、迭代器和生成器都是可迭代的,你可以将它们留在最后的循环中(这也是不必要的,因为 yield from
会处理这个) .
还有一个您没有处理的复杂性,即并非所有类文件对象都是 TextIOWrappers,例如StringIO
。我建议要么检查 TextIOBase
,要么寻找 read
方法(尽管这不是 csv.reader
实际寻找的方法,而且还允许原始 IO 对象)。
def get_rows(rows_source):
""" Obtains a rows iterator from the input:
* a string is assumed to be a file path, and read as a TSV file
* a file-like object is parsed as TSV
* other iterables are returned as-is
"""
if isinstance(rows_source, str):
with open(rows_source, encoding="utf-8") as csvf:
yield from process_doc_rows(csvf)
return
elif isinstance(rows_source, io.TextIOBase):
rows_source = csv.reader(rows_source, delimiter="\t")
yield from rows_source
如果您的函数必须在继续之前检查其参数的类型,那么它所做的工作太多了。您的函数应该专注于迭代和处理可迭代对象的每个元素。
def process_rows(rows):
for row in rows:
process_row(row))
with open("foo.csv") as f:
process_rows(csv.reader(f))
一个附加函数可以处理将每一行拆分为多个字段:
def process_csv_rows(rows):
process_rows(csv.reader(rows))
with open("foo.csv") as f:
process_csv_rows(f)
甚至可以为您打开文件:
def process_csv_file(filename):
with open(filename) as f:
process_csv_rows(f)
process_csv_file("foo.csv")
我有许多函数需要 CSV 文件的路径,或者可以转换为 CSV 行生成器的源,例如标准输入或列表(用于执行测试)。
我写过这个函数:
def process_doc_rows ( rows_source ):
if isinstance ( rows_source, str ):
# Open the file with the csv reader
with open ( rows_source ) as csvf:
yield from process_doc_rows ( csvf )
return
elif isinstance ( rows_source, io.TextIOBase ):
# This includes stdin
rows_source = csv.reader ( rows_source, delimiter = "\t" )
yield from rows_source
我可以这样调用:
def process_rows ( rows_src ):
l = 0; res = "";
for row in process_doc_rows ( rows_src ):
print ( "L:%d, Name:%s, Surname:%s" % ( l, row [ 0 ], row [ 1 ] ) )
l += 1
process_rows ( "/path/to/file.tsv" )
process_rows ( [ ["John", "Smith"], ["Karl", "Marx"], ["Emmanuel", "Kant"] ] )
process_rows ( sys.stdin )
现在,我的问题是:我是在重新发明轮子吗?是否有任何实用程序可以做同样的事情并且我不必重写?
编辑
感谢
def process_doc_rows ( rows_source ):
if ( type ( rows_source ) is str ):
# Open the file with the csv reader
with open ( rows_source ) as csvf:
for row in process_doc_rows ( csvf ):
yield row
return
elif isinstance ( rows_source, io.TextIOWrapper ):
# This includes stdin
rows_source = csv.reader ( rows_source, delimiter = "\t" )
elif isinstance ( rows_source, Iterable ) or isinstance ( rows_source, Iterator ):
rows_source = (e for e in rows_source)
elif isinstance ( rows_source, types.GeneratorType ):
raise TypeError ( "This function wants a file or a CSV-like generator" )
# We must do this in order to avoid the mix of return and yield
for row in rows_source:
yield row
Is there any utility around that does the same and that I don't have to rewrite?
不一定,API 通常会尝试比“文件名或文件对象或迭代器”定义得更明确,但我希望那些以这种方式工作的 API 也可以手动处理它。
你的系统虽然过于复杂(而且是错误的):可迭代对象、迭代器和生成器都是可迭代的,你可以将它们留在最后的循环中(这也是不必要的,因为 yield from
会处理这个) .
还有一个您没有处理的复杂性,即并非所有类文件对象都是 TextIOWrappers,例如StringIO
。我建议要么检查 TextIOBase
,要么寻找 read
方法(尽管这不是 csv.reader
实际寻找的方法,而且还允许原始 IO 对象)。
def get_rows(rows_source):
""" Obtains a rows iterator from the input:
* a string is assumed to be a file path, and read as a TSV file
* a file-like object is parsed as TSV
* other iterables are returned as-is
"""
if isinstance(rows_source, str):
with open(rows_source, encoding="utf-8") as csvf:
yield from process_doc_rows(csvf)
return
elif isinstance(rows_source, io.TextIOBase):
rows_source = csv.reader(rows_source, delimiter="\t")
yield from rows_source
如果您的函数必须在继续之前检查其参数的类型,那么它所做的工作太多了。您的函数应该专注于迭代和处理可迭代对象的每个元素。
def process_rows(rows):
for row in rows:
process_row(row))
with open("foo.csv") as f:
process_rows(csv.reader(f))
一个附加函数可以处理将每一行拆分为多个字段:
def process_csv_rows(rows):
process_rows(csv.reader(rows))
with open("foo.csv") as f:
process_csv_rows(f)
甚至可以为您打开文件:
def process_csv_file(filename):
with open(filename) as f:
process_csv_rows(f)
process_csv_file("foo.csv")