我有多个 csv files.Each csv 文件包含多个 table 和多个 headers.How 以获得 table 其 header 包含给定的特定值?

I have Multiple csv files.Each csv file contains multiple tables with multiple headers.How to get the table its header contains given specific value?

我在一个文件夹中有多个 csv 文件 (4000)。每个 csv 文件都有如下数据。数据长度,每个csv文件中不同行的header行数和header数可能不同。有多个 table 和 header,这些 table 都以同一列“a”开头。我想得到 table 其 header 包含“apple”和值。

输入


a   b   c   d   e   f   g   h   i           
1   2   3   4   5   6   7   8   9           
a   b1  c1  d1  e1  f1  g1                  
1   2   3   4   5   6   7                   
a   b2  c2  d2  e2  f2  g2  h2  i2  k2  l2  
3   5   6   7   3   4   5   6   7   7   0   
a   b3  d3  e3  g23 t4  apple   r4  w2  r5  t6  
1   2   3   4   5   6   7   8   9   1   1   2
1   2   3   4   5   6   7   8   9   10  1   2
1   2   3   4   5   6   7   8   9   11  1   2
1   2   3   4   5   6   7   8   9   12  1   2
1   2   3   4   5   6   7   8   9   13  1   2
1   2   3   4   5   6   7   8   9   14  1   2
1   2   3   4   5   6   7   8   9   15  1   2
1   2   3   4   5   6   7   8   9   16  1   2
1   2   3   4   5   6   7   8   9   17  1   2
1   2   3   4   5   6   7   8   9   18  1   2
a   b   c   d   e   f   g   h   i           
1   2   3   4   5   6   7   8   9           

最终输出

a   b3  d3  e3  g23 t4  apple   r4  w2  r5  t6
1   2   3   4   5   6   7   8   9   1   1   2
1   2   3   4   5   6   7   8   9   10  1   2
1   2   3   4   5   6   7   8   9   11  1   2
1   2   3   4   5   6   7   8   9   12  1   2
1   2   3   4   5   6   7   8   9   13  1   2
1   2   3   4   5   6   7   8   9   14  1   2
1   2   3   4   5   6   7   8   9   15  1   2
1   2   3   4   5   6   7   8   9   16  1   2
1   2   3   4   5   6   7   8   9   17  1   2
1   2   3   4   5   6   7   8   9   18  1   2

好的,根据我得到的信息,你必须手动遍历每个文件中的每一行,直到找到第一列只是 a and[= 的行25=] 包含一列 apple。从那里您知道那是正确的 headers,因此您开始以某种方式存储该行和之后的值行。下次当您看到第一列仅为 a 的行时,您就知道您已经达到了新的 headers.

pandas 可能无法直接执行此操作,因此您必须进行一些手动字符串插值。

buffer = ''
with open('filename') as f:
    found_apple = False
    for row in f:
        # if a row starts with 'a,' it's a header row
        has_a = row.startswith('a,')
        if found_apple:
            # if the row is a header row, we're done with the table and should wrap up
            if has_a:
                break
            # else it's a row that should be part of our output, so store it in a buffer
            buffer += row # row will already have the \n
        elif not has_a:
            # we aren't ready to look at values, and this row isn't a header row, so skip it
            continue
        elif 'apple' in row:
            # you might have to tweak this if there are headers that *contain* 'apple' but aren't the header you're looking for
            # we've found the start of the table we want, we're ready to start storing the value rows
            found_apple = True
            buffer += row

# buffer will be the table you want, as a string
# example:

# """a,apple
# 1,2"""

# if that's all you need, you can simply output buffer

# if you wanted to do other pandas stuff with that table, you can now pass buffer to pandas
import pandas as pd
from io import StringIO
table = pd.read(StringIO(buffer))

如果有不明白的地方,请告诉我。

编辑:要循环目录中的每个文件,只需将 with 换成另一个循环即可:

import os
buffer = ''
for filename in os.listdir():
    if not os.path.isfile(filename):
        continue
    with open(filename) as f:
        ...
    if buffer is not '':
        break

# buffer will be the table you want, as a string