提取两个 strings/title 之间的文本

Question

我有一个标题列表，我必须提取这些标题之间的文本。但是这些标题不遵循顺序（有时标题 1 可以是标题 3 等等），在这种情况下我该如何处理这种提取？

例子

Biography

text

text

Place of Birth

Text

Text

Life Style

text

text

Marriage

Text

Text

如果所有标题都按顺序放置，我可以使用下面的代码，但在我的情况下，这些标题不遵循顺序，它会随着不同的输入文件而不断变化。

with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
    copy = False
    for line in infile:
        if line.strip() == "Biography":
            copy = True
        elif line.strip() == "Place of Birth":
            copy = False
        elif copy:
            outfile.write(line)

Answer 1

前提是所有"titles"都提前知道了，改你原来的线路就够了：

elif line.strip() == "Place of Birth":

这样：

elif line.strip() in ["Place of Birth", "Life Style", "Marriage", ...]:

Answer 2

假设每个标题都以大写字母开头。

with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
    copy = False
    for line in infile:
        line = line.strip()
        if line[0] == line[0].capitalize():
            copy = True
        else:
            copy = False

        if copy:
            outfile.write(line)

Answer 3

如果您只想提取某些标题的数据而避免其他标题的数据，那么对于您将复制设置为 True 的那些，对于所有其他标题（确保匹配所有标题），所有其他标题都将复制设为 false。

例子-

if title in [<list of titles to save data>]:
    copy = True
elif title in [<list of titles to not save data>]:
    copy = False

要使用标题作为列名保存数据并将其中的数据作为记录保存，您可以先将每一列及其数据存储在另一个列表中的一行中，然后再使用 - [=12 转置该列表=] 其中 lst 是您的列表，然后您可以使用 numpy 使用此列表创建数组并将数据保存到 csv 中，分隔符为 ,.

示例代码-

import numpy
lst = []
with open('path/to/input') as infile:
    copy = False
    for line in infile:
        if line.strip() in ["Biography"]:
            copy = True
            lst.append([line.strip()])
        elif line.strip() in ["Place of Birth"]:
            copy = False
        elif copy:
            lst[-1].append(line.strip())

lst = list(zip(*lst))
n = numpy.array(lst)
numpy.savetext("foo.csv", n, delimiter=",")

提取两个 strings/title 之间的文本

Extracting text between two strings/title

python

text

extraction