我有多个 csv files.Each csv 文件包含多个 table 和多个 headers.How 以获得 table 其 header 包含给定的特定值?
I have Multiple csv files.Each csv file contains multiple tables with multiple headers.How to get the table its header contains given specific value?
我在一个文件夹中有多个 csv 文件 (4000)。每个 csv 文件都有如下数据。数据长度,每个csv文件中不同行的header行数和header数可能不同。有多个 table 和 header,这些 table 都以同一列“a”开头。我想得到 table 其 header 包含“apple”和值。
输入
a b c d e f g h i
1 2 3 4 5 6 7 8 9
a b1 c1 d1 e1 f1 g1
1 2 3 4 5 6 7
a b2 c2 d2 e2 f2 g2 h2 i2 k2 l2
3 5 6 7 3 4 5 6 7 7 0
a b3 d3 e3 g23 t4 apple r4 w2 r5 t6
1 2 3 4 5 6 7 8 9 1 1 2
1 2 3 4 5 6 7 8 9 10 1 2
1 2 3 4 5 6 7 8 9 11 1 2
1 2 3 4 5 6 7 8 9 12 1 2
1 2 3 4 5 6 7 8 9 13 1 2
1 2 3 4 5 6 7 8 9 14 1 2
1 2 3 4 5 6 7 8 9 15 1 2
1 2 3 4 5 6 7 8 9 16 1 2
1 2 3 4 5 6 7 8 9 17 1 2
1 2 3 4 5 6 7 8 9 18 1 2
a b c d e f g h i
1 2 3 4 5 6 7 8 9
最终输出
a b3 d3 e3 g23 t4 apple r4 w2 r5 t6
1 2 3 4 5 6 7 8 9 1 1 2
1 2 3 4 5 6 7 8 9 10 1 2
1 2 3 4 5 6 7 8 9 11 1 2
1 2 3 4 5 6 7 8 9 12 1 2
1 2 3 4 5 6 7 8 9 13 1 2
1 2 3 4 5 6 7 8 9 14 1 2
1 2 3 4 5 6 7 8 9 15 1 2
1 2 3 4 5 6 7 8 9 16 1 2
1 2 3 4 5 6 7 8 9 17 1 2
1 2 3 4 5 6 7 8 9 18 1 2
好的,根据我得到的信息,你必须手动遍历每个文件中的每一行,直到找到第一列只是 a
and[= 的行25=] 包含一列 apple
。从那里您知道那是正确的 headers,因此您开始以某种方式存储该行和之后的值行。下次当您看到第一列仅为 a
的行时,您就知道您已经达到了新的 headers.
pandas 可能无法直接执行此操作,因此您必须进行一些手动字符串插值。
buffer = ''
with open('filename') as f:
found_apple = False
for row in f:
# if a row starts with 'a,' it's a header row
has_a = row.startswith('a,')
if found_apple:
# if the row is a header row, we're done with the table and should wrap up
if has_a:
break
# else it's a row that should be part of our output, so store it in a buffer
buffer += row # row will already have the \n
elif not has_a:
# we aren't ready to look at values, and this row isn't a header row, so skip it
continue
elif 'apple' in row:
# you might have to tweak this if there are headers that *contain* 'apple' but aren't the header you're looking for
# we've found the start of the table we want, we're ready to start storing the value rows
found_apple = True
buffer += row
# buffer will be the table you want, as a string
# example:
# """a,apple
# 1,2"""
# if that's all you need, you can simply output buffer
# if you wanted to do other pandas stuff with that table, you can now pass buffer to pandas
import pandas as pd
from io import StringIO
table = pd.read(StringIO(buffer))
如果有不明白的地方,请告诉我。
编辑:要循环目录中的每个文件,只需将 with
换成另一个循环即可:
import os
buffer = ''
for filename in os.listdir():
if not os.path.isfile(filename):
continue
with open(filename) as f:
...
if buffer is not '':
break
# buffer will be the table you want, as a string
我在一个文件夹中有多个 csv 文件 (4000)。每个 csv 文件都有如下数据。数据长度,每个csv文件中不同行的header行数和header数可能不同。有多个 table 和 header,这些 table 都以同一列“a”开头。我想得到 table 其 header 包含“apple”和值。
输入
a b c d e f g h i
1 2 3 4 5 6 7 8 9
a b1 c1 d1 e1 f1 g1
1 2 3 4 5 6 7
a b2 c2 d2 e2 f2 g2 h2 i2 k2 l2
3 5 6 7 3 4 5 6 7 7 0
a b3 d3 e3 g23 t4 apple r4 w2 r5 t6
1 2 3 4 5 6 7 8 9 1 1 2
1 2 3 4 5 6 7 8 9 10 1 2
1 2 3 4 5 6 7 8 9 11 1 2
1 2 3 4 5 6 7 8 9 12 1 2
1 2 3 4 5 6 7 8 9 13 1 2
1 2 3 4 5 6 7 8 9 14 1 2
1 2 3 4 5 6 7 8 9 15 1 2
1 2 3 4 5 6 7 8 9 16 1 2
1 2 3 4 5 6 7 8 9 17 1 2
1 2 3 4 5 6 7 8 9 18 1 2
a b c d e f g h i
1 2 3 4 5 6 7 8 9
最终输出
a b3 d3 e3 g23 t4 apple r4 w2 r5 t6
1 2 3 4 5 6 7 8 9 1 1 2
1 2 3 4 5 6 7 8 9 10 1 2
1 2 3 4 5 6 7 8 9 11 1 2
1 2 3 4 5 6 7 8 9 12 1 2
1 2 3 4 5 6 7 8 9 13 1 2
1 2 3 4 5 6 7 8 9 14 1 2
1 2 3 4 5 6 7 8 9 15 1 2
1 2 3 4 5 6 7 8 9 16 1 2
1 2 3 4 5 6 7 8 9 17 1 2
1 2 3 4 5 6 7 8 9 18 1 2
好的,根据我得到的信息,你必须手动遍历每个文件中的每一行,直到找到第一列只是 a
and[= 的行25=] 包含一列 apple
。从那里您知道那是正确的 headers,因此您开始以某种方式存储该行和之后的值行。下次当您看到第一列仅为 a
的行时,您就知道您已经达到了新的 headers.
pandas 可能无法直接执行此操作,因此您必须进行一些手动字符串插值。
buffer = ''
with open('filename') as f:
found_apple = False
for row in f:
# if a row starts with 'a,' it's a header row
has_a = row.startswith('a,')
if found_apple:
# if the row is a header row, we're done with the table and should wrap up
if has_a:
break
# else it's a row that should be part of our output, so store it in a buffer
buffer += row # row will already have the \n
elif not has_a:
# we aren't ready to look at values, and this row isn't a header row, so skip it
continue
elif 'apple' in row:
# you might have to tweak this if there are headers that *contain* 'apple' but aren't the header you're looking for
# we've found the start of the table we want, we're ready to start storing the value rows
found_apple = True
buffer += row
# buffer will be the table you want, as a string
# example:
# """a,apple
# 1,2"""
# if that's all you need, you can simply output buffer
# if you wanted to do other pandas stuff with that table, you can now pass buffer to pandas
import pandas as pd
from io import StringIO
table = pd.read(StringIO(buffer))
如果有不明白的地方,请告诉我。
编辑:要循环目录中的每个文件,只需将 with
换成另一个循环即可:
import os
buffer = ''
for filename in os.listdir():
if not os.path.isfile(filename):
continue
with open(filename) as f:
...
if buffer is not '':
break
# buffer will be the table you want, as a string