对具有非常特定格式的文件进行操作
Operating on a file with a very specific format
我一直在尝试编写以下函数:
def track(filepath,n1,n2)
此函数用于操作具有以下格式的文件:
-BORDER-
text
-BORDER-
text
-BORDER-
text
-BORDER-
如何让函数在此文件路径上操作,更准确地说是在每个边框内的文本上操作?
要从您的文本文件中检索文本,您可以执行以下操作:
with open("/your/path/to/file", 'r') as f:
text_list = [line for line in f.readlines() if 'BORDER' not in line]
text_list
将包含您要查找的所有文本行。如果需要,您可以使用 .strip()
删除行
编写一个生成器,计算检测边界线并使用 groupby
分隔这些块:
from itertools import groupby
BORDER = '--border--'
def count_border(lines, border):
cnt = 0
for line in lines:
if line.strip() == border:
cnt += 1
else:
yield cnt, line
with open('file') as lines:
for _, block in groupby(count_border(lines, BORDER), lambda (c,_): c):
block = [line for _, line in block]
print(block)
以下方法将读入您的文件,并为您提供一个非边界线列表:
from itertools import groupby
with open('input.txt') as f_input:
for k, g in groupby(f_input, lambda x: not x.startswith('-BORDER-')):
if k:
print([line.strip() for line in g])
因此,如果您的输入文件是:
-BORDER-
text
-BORDER-
text
-BORDER-
this is some text
with words
on different lines
-BORDER-
它将显示以下输出:
['text']
['text']
['this is some text', 'with words', 'on different lines']
这是通过逐行读取您的文件,并使用 Python 的 groupby
函数对匹配给定测试的行进行分组来实现的。在这种情况下,测试是行是否开始 -BORDER-
。它 return 以下所有 return 结果相同的行。 k
是测试结果,g
是匹配行组。所以如果测试结果是True
,说明不是-BORDER-
开头的。
接下来,由于您的每一行都有一个换行符,因此使用列表理解从每一行 returned 中删除它。
如果您想计算字数(假设它们由空格分隔),那么您可以执行以下操作:
from itertools import groupby
with open('input.txt') as f_input:
for k, g in groupby(f_input, lambda x: not x.startswith('-BORDER-')):
if k:
lines = list(g)
word_count = sum(len(line.split()) for line in lines)
print("{} words in {}".format(word_count, lines))
给你:
1 words in ['text\n']
1 words in ['text\n']
9 words in ['this is some text\n', 'with words \n', 'on different lines\n']
我一直在尝试编写以下函数:
def track(filepath,n1,n2)
此函数用于操作具有以下格式的文件:
-BORDER-
text
-BORDER-
text
-BORDER-
text
-BORDER-
如何让函数在此文件路径上操作,更准确地说是在每个边框内的文本上操作?
要从您的文本文件中检索文本,您可以执行以下操作:
with open("/your/path/to/file", 'r') as f:
text_list = [line for line in f.readlines() if 'BORDER' not in line]
text_list
将包含您要查找的所有文本行。如果需要,您可以使用 .strip()
编写一个生成器,计算检测边界线并使用 groupby
分隔这些块:
from itertools import groupby
BORDER = '--border--'
def count_border(lines, border):
cnt = 0
for line in lines:
if line.strip() == border:
cnt += 1
else:
yield cnt, line
with open('file') as lines:
for _, block in groupby(count_border(lines, BORDER), lambda (c,_): c):
block = [line for _, line in block]
print(block)
以下方法将读入您的文件,并为您提供一个非边界线列表:
from itertools import groupby
with open('input.txt') as f_input:
for k, g in groupby(f_input, lambda x: not x.startswith('-BORDER-')):
if k:
print([line.strip() for line in g])
因此,如果您的输入文件是:
-BORDER-
text
-BORDER-
text
-BORDER-
this is some text
with words
on different lines
-BORDER-
它将显示以下输出:
['text']
['text']
['this is some text', 'with words', 'on different lines']
这是通过逐行读取您的文件,并使用 Python 的 groupby
函数对匹配给定测试的行进行分组来实现的。在这种情况下,测试是行是否开始 -BORDER-
。它 return 以下所有 return 结果相同的行。 k
是测试结果,g
是匹配行组。所以如果测试结果是True
,说明不是-BORDER-
开头的。
接下来,由于您的每一行都有一个换行符,因此使用列表理解从每一行 returned 中删除它。
如果您想计算字数(假设它们由空格分隔),那么您可以执行以下操作:
from itertools import groupby
with open('input.txt') as f_input:
for k, g in groupby(f_input, lambda x: not x.startswith('-BORDER-')):
if k:
lines = list(g)
word_count = sum(len(line.split()) for line in lines)
print("{} words in {}".format(word_count, lines))
给你:
1 words in ['text\n']
1 words in ['text\n']
9 words in ['this is some text\n', 'with words \n', 'on different lines\n']