有没有办法操纵 Python 中编号的段落以删除某些不按顺序排列的段落？

Question

我有一串文本，其中包含从“1”开始的编号段落。到'221'，但是，有些段落不符合顺序，我想删除它们。数据如下所示：

text = """1. Shares of Paras Defence and Space Technologies gained 2.85 times. 
2. The company, engaged in manufacturing and testing of defence and space engineering products. 
"3. Its stock ended at Rs 499 versus issue price of Rs 175 per share.
42. On July 23, Zomato NSE 0.00 % Ltd. listed on the Indian stock exchanges.
43. That was exactly a week after the food-delivery and restaurant discovery platform's initial public offering went live. 
4. Paras Defence’s IPO, which closed on September 23, had generated bids worth Rs 38,021 crore. 
5. It surpassed the previous record of Salasar Technologies’ IPO. 
14. NBFCs are betting big time on the IPO. 
6. Paras Defence is one of the few players having an edge in defence deals."""

从上面的文字中，我想删除不按顺序的段落内容，即。 '42.', '43.'和“14”。

所需输出：

relevant_text = '1. Shares of Paras Defence and Space Technologies gained 2.85 times. 
2. The company, engaged in manufacturing and testing of defence and space engineering products. 
3. Its stock ended at Rs 499 versus issue price of Rs 175 per share. 
4. Paras Defence’s IPO, which closed on September 23, had generated bids worth Rs 38,021 crore. 
5. It surpassed the previous record of Salasar Technologies’ IPO. 
6. Paras Defence is one of the few players having an edge in defence deals.'

我尝试匹配模式，但不知道如何继续。另外，我不确定正则表达式模式是否正确，因为它匹配“1.”、“2.”。等等，但不是“3.”。这是我想出的：

text_sequence = []

pattern = re.compile('(\s|["])[0-9]{1,3}\.\s')
matches = pattern.finditer(text)


for match in matches:
  for r in range(1, 999):
    if str(r) in match.group():
      text_sequence.append(match.span())
      text_sequence.append(match.group())

print(text_sequence)

有没有办法得到想要的输出？

P.S: 我从这段代码中得到的匹配结果重复。

Answer 1

你可以这样做：

text = """1. Shares of Paras Defence and Space Technologies gained 2.85 times. 
2. The company, engaged in manufacturing and testing of defence and space engineering products. 
3. Its stock ended at Rs 499 versus issue price of Rs 175 per share.
42. On July 23, Zomato NSE 0.00 % Ltd. listed on the Indian stock exchanges.
43. That was exactly a week after the food-delivery and restaurant discovery platform's initial public offering went live. 
4. Paras Defence’s IPO, which closed on September 23, had generated bids worth Rs 38,021 crore. 
5. It surpassed the previous record of Salasar Technologies’ IPO. 
14. NBFCs are betting big time on the IPO. 
6. Paras Defence is one of the few players having an edge in defence deals."""
lines = text.split("\n")
output = ""
i = 0
for l in lines:
    if (l.startswith("{}. ".format(i+1))):
        output+=l+"\n"
        i+=1
        
print(output)

前提是你去掉了第3行多余的"。如果你能保证后面带点的行号不在字符串中，你也可以考虑使用"in"而不是"startswith"。

Answer 2

如果您可以将所有这些要点与单个模式匹配，例如

(?s)((\d+)\. .*?)[^\w!?.…]*(?=\d+\. |\Z)

（参见this regex demo）和假设它们按升序排列，那么可以用

来解决

import re
pattern = r"((\d+)\. .*?)[^\w!?.…]*(?=\d+\. |\Z)"
text = "1. Shares of Paras Defence and Space Technologies gained 2.85 times. 2. The company, engaged in manufacturing and testing of defence and space engineering products. \"3. Its stock ended at Rs 499 versus issue price of Rs 175 per share. 42. On July 23, Zomato NSE 0.00 % Ltd. listed on the Indian stock exchanges. 43. That was exactly a week after the food-delivery and restaurant discovery platform's initial public offering went live. 4. Paras Defence’s IPO, which closed on September 23, had generated bids worth Rs 38,021 crore. 5. It surpassed the previous record of Salasar Technologies’ IPO. 14. NBFCs are betting big time on the IPO. 6. Paras Defence is one of the few players having an edge in defence deals."
result = []
idx = 1
for sent, num in re.findall(pattern, text, re.S):
    if int(num) == idx:
        result.append(sent)
        idx += 1

print("\n".join(result))

参见 this Python demo。正则表达式匹配

((\d+)\. .*?) - 第 1 组：
[^\w!?.…]* - 除单词和最后一句标点符号之外的任何零个或多个字符
(?=\d+\. |\Z) - 需要字符串结尾 (\Z) 或 (|) 一个或多个数字的正前瞻

输出：

1. Shares of Paras Defence and Space Technologies gained 2.85 times.
2. The company, engaged in manufacturing and testing of defence and space engineering products.
3. Its stock ended at Rs 499 versus issue price of Rs 175 per share.
4. Paras Defence’s IPO, which closed on September 23, had generated bids worth Rs 38,021 crore.
5. It surpassed the previous record of Salasar Technologies’ IPO.
6. Paras Defence is one of the few players having an edge in defence deals.

注意：如果您首先按 num 排序，如果项目符号点按非升序排列，则可以对此进行调整。

有没有办法操纵 Python 中编号的段落以删除某些不按顺序排列的段落？

Is there a way to manipulate numbered paragraphs in Python to remove certain paragraphs which do not fall in order?

python

regex

nlp