Python,用于迭代文件或另一个行生成器的函数

Python, function to iterate over a file or another row generator

我有许多函数需要 CSV 文件的路径,或者可以转换为 CSV 行生成器的源,例如标准输入或列表(用于执行测试)。

我写过这个函数:

def process_doc_rows ( rows_source ):
    if isinstance ( rows_source, str ):
        # Open the file with the csv reader
        with open ( rows_source ) as csvf:
            yield from process_doc_rows ( csvf )
        return
    
    elif isinstance ( rows_source, io.TextIOBase ):
        # This includes stdin
        rows_source = csv.reader ( rows_source, delimiter = "\t" )

    yield from rows_source  

我可以这样调用:

def process_rows ( rows_src ):
  l = 0; res = "";
  for row in process_doc_rows ( rows_src ):
    print ( "L:%d, Name:%s, Surname:%s" % ( l, row [ 0 ], row [ 1 ] ) )
    l += 1

process_rows ( "/path/to/file.tsv" )
process_rows ( [ ["John", "Smith"], ["Karl", "Marx"], ["Emmanuel", "Kant"] ] )
process_rows ( sys.stdin )

现在,我的问题是:我是在重新发明轮子吗?是否有任何实用程序可以做同样的事情并且我不必重写?

编辑

感谢,我以更简单的方式重写了上面的函数。旧版本是:

def process_doc_rows ( rows_source ):
    if ( type ( rows_source ) is str ):
        # Open the file with the csv reader
        with open ( rows_source ) as csvf:
            for row in process_doc_rows ( csvf ):
                yield row
            return
    
    elif isinstance ( rows_source, io.TextIOWrapper ):
        # This includes stdin
        rows_source = csv.reader ( rows_source, delimiter = "\t" )
    
    elif isinstance ( rows_source, Iterable ) or isinstance ( rows_source, Iterator ):
        rows_source = (e for e in rows_source)
    
    elif isinstance ( rows_source, types.GeneratorType ):
        raise TypeError ( "This function wants a file or a CSV-like generator" )
    
    # We must do this in order to avoid the mix of return and yield
    for row in rows_source:
        yield row

Is there any utility around that does the same and that I don't have to rewrite?

不一定,API 通常会尝试比“文件名或文件对象或迭代器”定义得更明确,但我希望那些以这种方式工作的 API 也可以手动处理它。

你的系统虽然过于复杂(而且是错误的):可迭代对象、迭代器和生成器都是可迭代的,你可以将它们留在最后的循环中(这也是不必要的,因为 yield from 会处理这个) .

还有一个您没有处理的复杂性,即并非所有类文件对象都是 TextIOWrappers,例如StringIO。我建议要么检查 TextIOBase,要么寻找 read 方法(尽管这不是 csv.reader 实际寻找的方法,而且还允许原始 IO 对象)。

def get_rows(rows_source):
    """ Obtains a rows iterator from the input:

    * a string is assumed to be a file path, and read as a TSV file
    * a file-like object is parsed as TSV
    * other iterables are returned as-is
    """
    if isinstance(rows_source, str):
        with open(rows_source, encoding="utf-8") as csvf:
            yield from process_doc_rows(csvf)
        return
    elif isinstance(rows_source, io.TextIOBase):
        rows_source = csv.reader(rows_source, delimiter="\t")
    
    yield from rows_source

如果您的函数必须在继续之前检查其参数的类型,那么它所做的工作太多了。您的函数应该专注于迭代和处理可迭代对象的每个元素。

def process_rows(rows):
    for row in rows:
        process_row(row))


with open("foo.csv") as f:
    process_rows(csv.reader(f))

一个附加函数可以处理将每一行拆分为多个字段:

def process_csv_rows(rows):
    process_rows(csv.reader(rows))


with open("foo.csv") as f:
    process_csv_rows(f)

甚至可以为您打开文件:

def process_csv_file(filename):
    with open(filename) as f:
        process_csv_rows(f)


process_csv_file("foo.csv")