我有多个 csv files.Each csv 文件包含多个 table 和多个 headers.How 以获得 table 其 header 包含给定的特定值？

Question

我在一个文件夹中有多个 csv 文件 (4000)。每个 csv 文件都有如下数据。数据长度，每个csv文件中不同行的header行数和header数可能不同。有多个 table 和 header，这些 table 都以同一列“a”开头。我想得到 table 其 header 包含“apple”和值。

输入


a   b   c   d   e   f   g   h   i           
1   2   3   4   5   6   7   8   9           
a   b1  c1  d1  e1  f1  g1                  
1   2   3   4   5   6   7                   
a   b2  c2  d2  e2  f2  g2  h2  i2  k2  l2  
3   5   6   7   3   4   5   6   7   7   0   
a   b3  d3  e3  g23 t4  apple   r4  w2  r5  t6  
1   2   3   4   5   6   7   8   9   1   1   2
1   2   3   4   5   6   7   8   9   10  1   2
1   2   3   4   5   6   7   8   9   11  1   2
1   2   3   4   5   6   7   8   9   12  1   2
1   2   3   4   5   6   7   8   9   13  1   2
1   2   3   4   5   6   7   8   9   14  1   2
1   2   3   4   5   6   7   8   9   15  1   2
1   2   3   4   5   6   7   8   9   16  1   2
1   2   3   4   5   6   7   8   9   17  1   2
1   2   3   4   5   6   7   8   9   18  1   2
a   b   c   d   e   f   g   h   i           
1   2   3   4   5   6   7   8   9

最终输出

a   b3  d3  e3  g23 t4  apple   r4  w2  r5  t6
1   2   3   4   5   6   7   8   9   1   1   2
1   2   3   4   5   6   7   8   9   10  1   2
1   2   3   4   5   6   7   8   9   11  1   2
1   2   3   4   5   6   7   8   9   12  1   2
1   2   3   4   5   6   7   8   9   13  1   2
1   2   3   4   5   6   7   8   9   14  1   2
1   2   3   4   5   6   7   8   9   15  1   2
1   2   3   4   5   6   7   8   9   16  1   2
1   2   3   4   5   6   7   8   9   17  1   2
1   2   3   4   5   6   7   8   9   18  1   2

Answer 1

好的，根据我得到的信息，你必须手动遍历每个文件中的每一行，直到找到第一列只是 a and[= 的行25=] 包含一列 apple。从那里您知道那是正确的 headers，因此您开始以某种方式存储该行和之后的值行。下次当您看到第一列仅为 a 的行时，您就知道您已经达到了新的 headers.

pandas 可能无法直接执行此操作，因此您必须进行一些手动字符串插值。

buffer = '' with open('filename') as f: found_apple = False for row in f: # if a row starts with 'a,' it's a header row has_a = row.startswith('a,') if found_apple: # if the row is a header row, we're done with the table and should wrap up if has_a: break # else it's a row that should be part of our output, so store it in a buffer buffer += row # row will already have the \n elif not has_a: # we aren't ready to look at values, and this row isn't a header row, so skip it continue elif 'apple' in row: # you might have to tweak this if there are headers that *contain* 'apple' but aren't the header you're looking for # we've found the start of the table we want, we're ready to start storing the value rows found_apple = True buffer += row # buffer will be the table you want, as a string # example: # """a,apple # 1,2""" # if that's all you need, you can simply output buffer # if you wanted to do other pandas stuff with that table, you can now pass buffer to pandas import pandas as pd from io import StringIO table = pd.read(StringIO(buffer))

如果有不明白的地方，请告诉我。

编辑：要循环目录中的每个文件，只需将 with 换成另一个循环即可：

import os buffer = '' for filename in os.listdir(): if not os.path.isfile(filename): continue with open(filename) as f: ... if buffer is not '': break # buffer will be the table you want, as a string

我有多个 csv files.Each csv 文件包含多个 table 和多个 headers.How 以获得 table 其 header 包含给定的特定值？

I have Multiple csv files.Each csv file contains multiple tables with multiple headers.How to get the table its header contains given specific value?

python

parsing

datatables

pandas