复杂的正则表达式比预期的要少

Question

我正在尝试 fiddle 在 Python 2.7 中使用正则表达式来捕获文本中的编号脚注。我从 PDF 转换的文本如下所示：

test_str = u"""
7. On 6 March 2013, the Appeals Chamber filed the Decision on Victim 
Participation, in which it decided that the victims “may, through their legal 

1
 The full citation, including the ICC registration reference of all designations and abbreviations used in 
this judgment are included in Annex 1. 
2
 A more detailed procedural history is set out in Annex 2 of this judgment. 
ICC-01/04-02/12-271-Corr  07-04-2015  7/117  EK  A

 8/117 
representatives, participate in the present appeal proceedings for the purpose of 
presenting their views and concerns in respect of their personal interests in the issues 
on appeal”.3

8. On 19 March 2013, the Prosecutor filed, confidentially, ex parte, available to the 
Prosecutor and Mr Ngudjolo only, the Document in Support of the Appeal. The 
Prosecutor filed a confidential redacted version of the Document in Support of the 
Appeal on 22 March 2013, and a public redacted version of the Document in Support 
of the Appeal on 3 April 2013. In the redacted version of the Document in Support of 
the Appeal, the Prosecutor’s entire third ground of appeal was redacted. 

"""

请注意，编号的段落是我文本的常规内容，以数字和点（如“5”）为前缀。理想情况下，我想得到类似的东西：

[(1,"The full citation, including the ICC registration reference of all designations and abbreviations used in 
this judgment are included in Annex 1. "), (2, "A more detailed procedural history is set out in Annex 2 of this judgment."

我的 Python 获取脚注的代码是：

regex = ur"""
(\r?\n)(?P<num>\d+)(?!\.) #first line
(?P<text>(?:\s(.|\r?\n)+?\s?(?:\n\n|\Z))) #following lines
"""
result = re.findall(regex, test_str, re.U|re.VERBOSE | re.X |re.MULTILINE)

这给了我：

[(u'\n', u'1', u'\n The full citation, including the ICC registration reference of all designations and abbreviations used in \nthis judgment are included in Annex 1. \n\n', u'.')]

即只有 第一个脚注，而我需要两者都偏离轨道

欢迎提出任何想法！

Answer 1

您可以使用此正则表达式将数据分成两部分，第一部分是数字，第二部分是段落数据，

(?s)(\d+)\n +(.*?)\s*(?=\d+\n)

解释：

(?s) --> 使点匹配我们在这里需要的新行
(\d+) --> 匹配一个或多个数字并将它们放入group1
\n + --> 匹配换行符并且 " +" 只吃任何不需要进入第二个捕获组
(.*?) --> 该组捕获预期的数据并放置在 group2
\s* --> 这只会吃掉任何不需要进入预期文本捕获的 space
(?=\d+\n) --> Look ahead point to stop capture the intended text

Live Demo

这是您的代码的修改版本，

import re

test_str = u"""
7. On 6 March 2013, the Appeals Chamber filed the Decision on Victim 
Participation, in which it decided that the victims “may, through their legal 

1
 The full citation, including the ICC registration reference of all designations and abbreviations used in 
this judgment are included in Annex 1. 
2
 A more detailed procedural history is set out in Annex 2 of this judgment. 
ICC-01/04-02/12-271-Corr  07-04-2015  7/117  EK  A

 8/117 
representatives, participate in the present appeal proceedings for the purpose of 
presenting their views and concerns in respect of their personal interests in the issues 
on appeal”.
3

8. On 19 March 2013, the Prosecutor filed, confidentially, ex parte, available to the 
Prosecutor and Mr Ngudjolo only, the Document in Support of the Appeal. The 
Prosecutor filed a confidential redacted version of the Document in Support of the 
Appeal on 22 March 2013, and a public redacted version of the Document in Support 
of the Appeal on 3 April 2013. In the redacted version of the Document in Support of 
the Appeal, the Prosecutor’s entire third ground of appeal was redacted. 

"""

result = re.findall(r'(?s)(\d+)\n +(.*?)\s*(?=\d+\n)', test_str)

print(result)

它给出了您期望的以下输出，

[('1', 'The full citation, including the ICC registration reference of all designations and abbreviations used in \nthis judgment are included in Annex 1.'), ('2', 'A more detailed procedural history is set out in Annex 2 of this judgment. \nICC-01/04-02/12-271-Corr  07-04-2015  7/117  EK  A\n\n 8/117 \nrepresentatives, participate in the present appeal proceedings for the purpose of \npresenting their views and concerns in respect of their personal interests in the issues \non appeal".')]

Answer 2

我相信这个正则表达式：(^\d+(?!\.).*?)(?=^\s*\d) 如您所描述的那样工作。

Demo

Python 演示：

>>> import re
>>> print ''.join(re.findall(r'(^\d+(?!\.).*?)(?=^\s*\d)', test_str, flags=re.M|re.S))
1
 The full citation, including the ICC registration reference of all designations and abbreviations used in 
this judgment are included in Annex 1. 
2
 A more detailed procedural history is set out in Annex 2 of this judgment. 
ICC-01/04-02/12-271-Corr  07-04-2015  7/117  EK  A

如果要捕获与文本分开的脚注编号：

>>> re.findall(r'^(\d+)((?!\.).*?)(?=\s*^\d)', test_str, flags=re.M|re.S)
[(u'1', u'\n The full citation, including the ICC registration reference of all designations and abbreviations used in \nthis judgment are included in Annex 1. \n'), (u'2', u'\n A more detailed procedural history is set out in Annex 2 of this judgment. \nICC-01/04-02/12-271-Corr  07-04-2015  7/117  EK  A\n')]

复杂的正则表达式比预期的要少

Complex regular expression getting less than expected

python

regex

text-mining