Python3 和正则表达式：如何删除数字行？

Question

我有一个从 PDF 转换而来的长文本文件，我想删除某些内容的实例，例如就像页码会自己出现但可能被空格包围。我制作了一个适用于短线的正则表达式：例如

news1 = 'Hello done.\n4\nNext paragraph.'
m = re.sub('\n *[0-9] *\n', ' ', news1)
print(m)
Hello done. Next paragraph.

但是当我在更复杂的字符串上尝试这个时，它失败了，例如

news = '1   \n  Hello done. \n 4 \n  44 \n  Next paragraph.'
m = re.sub('\n *[0-9] *\n', ' ', news)
print(m)
1   
  Hello done.    44 
Next paragraph.

如何在整个文件中进行这项工作？我是否应该逐行阅读并逐行处理，而不是尝试编辑整个字符串？

我也尝试过使用句点来匹配任何东西，但在更复杂的字符串中没有得到初始的“1”。所以我想我可以做 2 个正则表达式。

m = re.sub('. *[0-9] *.', '', news)
1   
  Hello done. 


  Next paragraph.

想法？

Answer 1

我建议逐行执行，除非您有某些特定原因将其全部作为字符串插入。然后只需几个正则表达式来清理它，如：

#not sure how the pages are numbered, but perhaps...
text = re.sub(r"^\s*\d+\s*$", "", text)

#chuck a line in to strip out stuff in all caps of at least 3 letters
text = re.sub(r"[A-Z]{3,}", "", text)

#concatenate multiple whitespace to 1 space, handy to clean up the data
text = re.sub(r"\s+", " ", text)

#trim the start and end of the line
text = text.strip()

只有一个策略，但这是我要走的路线，随着您的业务方面提出 "OH OH! Can you also replace any mention of 'Cat' with 'Dog'?"，我认为 toubleshoot/log 您的更改也更容易维护。甚至可以尝试使用 re.subn 来跟踪更改... ?

Python3 和正则表达式：如何删除数字行？

Python3 and regex: how to remove lines of numbers?

python

regex

python-3.4