在 python 中打印第一段

Question

我有一本书的文本文件，我需要打印每个部分的第一段。我想如果我在 \n\n 和 \n 之间找到一个文本，我就能找到我的答案。这是我的代码，但没有用。你能告诉我我哪里错了吗？

lines = [line.rstrip('\n') for line in open('G:\aa.txt')]

check = -1
first = 0
last = 0

for i in range(len(lines)):
    if lines[i] == "": 
            if lines[i+1]=="":
                check = 1
                first = i +2
    if i+2< len(lines):
        if lines[i+2] == "" and check == 1:
            last = i+2
while (first < last):
    print(lines[first])
    first = first + 1

我也在Whosebug中找到了一个代码，我也试过了，但它只是打印了一个空数组。

f = open("G:\aa.txt").readlines()
flag=False
for line in f:
        if line.startswith('\n\n'):
            flag=False
        if flag:
            print(line)
        elif line.strip().endswith('\n'):
            flag=True

我在下面分享了这本书的示例部分。

我

地势

人类感兴趣的广阔天地就在我们的门外，但迄今为止还鲜有人探索过。是动物智力领域。

在研究世界野生动物的各种兴趣中，有 none 超越了对它们的思想、道德以及它们作为它们的结果所表现的行为的研究心理过程。

二

野性的气质与个性

我在这里要做的是，找到大写的行，并将它们全部放在一个数组中。然后，使用索引方法，我将通过比较我创建的这个数组的这些元素的索引来找到每个部分的第一段和最后一段。

输出应该是这样的：

人类感兴趣的广阔天地就在我们的门外，但迄今为止还鲜有人探索过。是动物智力领域。

我在这里要做的是，找到大写的行，并将它们全部放在一个数组中。然后，使用索引方法，我将通过比较我创建的这个数组的这些元素的索引来找到每个部分的第一段和最后一段。

Answer 1

逐行检查您找到的代码。

f = open("G:\aa.txt").readlines()
flag=False
for line in f:
        if line.startswith('\n\n'):
            flag=True
        if flag:
            print(line)
        elif line.strip().endswith('\n'):
            flag=True

它似乎从未将标志变量设置为 true。

如果您能分享书中的一些示例，那对每个人都会更有帮助。

Answer 2

如果您想对这些部分进行分组，您可以使用 itertools.groupby 使用空行作为分隔符：

from itertools import groupby
with open("in.txt") as f:
    for k, sec in groupby(f,key=lambda x: bool(x.strip())):
        if k:
            print(list(sec))

通过更多的 itertools foo，我们可以获得使用大写标题作为分隔符的部分：

from itertools import groupby, takewhile

with open("in.txt") as f:
    grps = groupby(f,key=lambda x: x.isupper())
    for k, sec in grps:
        # if we hit a title line
        if k: 
            # pull all paragraphs
            v = next(grps)[1]
            # skip two empty lines after title
            next(v,""), next(v,"")

            # take all lines up to next empty line/second paragraph
            print(list(takewhile(lambda x: bool(x.strip()), v)))

哪个会给你：

['There is a vast field of fascinating human interest, lying only just outside our doors, which as yet has been but little explored. It is the Field of Animal Intelligence.\n']
['What I am trying to do here is, find the uppercase lines, and put them all in an array. Then, using the index method, I will find the first and last paragraphs of each section by comparing the indexes of these elements of this array I created.']

每个部分的开头都有一个全大写的标题，所以一旦我们点击我们知道有两个空行，那么第一段和模式就会重复。

使用循环将其分解：

from itertools import groupby  
from itertools import groupby
def parse_sec(bk):
    with open(bk) as f:
        grps = groupby(f, key=lambda x: bool(x.isupper()))
        for k, sec in grps:
            if k:
                print("First paragraph from section titled :{}".format(next(sec).rstrip()))
                v = next(grps)[1]
                next(v, ""),next(v,"")
                for line in v:
                    if not line.strip():
                        break
                    print(line)

您的文字：

In [11]: cat -E in.txt

THE LAY OF THE LAND$
$
$
There is a vast field of fascinating human interest, lying only just outside our doors, which as yet has been but little explored. It is the Field of Animal Intelligence.$
$
Of all the kinds of interest attaching to the study of the world's wild animals, there are none that surpass the study of their minds, their morals, and the acts that they perform as the results of their mental processes.$
$
$
WILD ANIMAL TEMPERAMENT & INDIVIDUALITY$
$
$
What I am trying to do here is, find the uppercase lines, and put them all in an array. Then, using the index method, I will find the first and last paragraphs of each section by comparing the indexes of these elements of this array I created.

美元符号是新行，输出是：

In [12]: parse_sec("in.txt")
First paragraph from section titled :THE LAY OF THE LAND
There is a vast field of fascinating human interest, lying only just outside our doors, which as yet has been but little explored. It is the Field of Animal Intelligence.

First paragraph from section titled :WILD ANIMAL TEMPERAMENT & INDIVIDUALITY
What I am trying to do here is, find the uppercase lines, and put them all in an array. Then, using the index method, I will find the first and last paragraphs of each section by comparing the indexes of these elements of this array I created.

Answer 3

这应该可以，只要没有全部大写的段落：

    f = open('file.txt')

    for line in f:
    line = line.strip()
    if line:  
        for c in line:
            if c < 'A' or c > 'Z': # check for non-uppercase chars
                break
        else:        # means the line is made of all caps i.e. I, II, etc, meaning new section
            f.readline()  # discard chapter headers and empty lines
            f.readline()
            f.readline()
            print(f.readline().rstrip()) # print first paragraph

    f.close()

如果你也想得到最后一段，你可以跟踪最后看到的包含小写字符的行，然后一旦你找到一个全大写的行（I、II 等），表示一个新的节，然后打印最近的一行，因为那将是上一节的最后一段。

Answer 4

总有正则表达式....

import re
with open("in.txt", "r") as fi:
    data = fi.read()
paras = re.findall(r"""
                   [IVXLCDM]+\n\n   # Line of Roman numeral characters
                   [^a-z]+\n\n      # Line without lower case characters
                   (.*?)\n          # First paragraph line
                   """, data, re.VERBOSE)
print "\n\n".join(paras)

Answer 5

TXR解决方案

$ txr firstpar.txr data
There is a vast field of fascinating human interest, lying only just outside our doors, which as yet has been but little explored. It is the Field of Animal Intelligence.
What I am trying to do here is, find the uppercase lines, and put them all in an array. Then, using the index method, I will find the first and last paragraphs of each section by comparing the indexes of these elements of this array I created.

firstpar.txr中的代码：

@(repeat)
@num

@title

@firstpar
@  (require (and (< (length num) 5)
                 [some title chr-isupper]
                 (not [some title chr-islower])))
@  (do (put-line firstpar))
@(end)

基本上，我们在输入中搜索 three-element multi-line 模式的模式匹配，该模式绑定了 num、title 和 firstpar 变量。现在这个模式本身可以匹配错误的地方，所以添加一些带有 require 断言的约束启发式方法。节号必须是短行，标题行必须包含一些大写字母，不能包含 lower-case 个。这个表达式是用 TXR Lisp 写的。

如果我们得到与此约束的匹配项，那么我们将输出在 firstpar 变量中捕获的字符串。

在 python 中打印第一段

print first paragraph in python

python

text

paragraph