仅基于大写从文本中提取命名实体的正则表达式

Question

我想要 Python 中的正则表达式，它提取一次或多次出现的以大写字母开头的单词，除非该单词出现在第一个单词中。我知道这不是一种稳健且一致的方法，但它会解决我的问题，因为我不想使用任何统计方法（例如，在 NLTK 或 StanfordNER 中）。

示例：

extract('His name is John Wayne.')

应该return['John Wayne'].

extract('He is The President of Neverland.')

应该 return ['The President', 'Neverland'] 因为它们是大写的单词并且不会出现在句子的开头。

另一个例子：

extract('He came home. Although late, it was nice to have Patrick there.')

应该 return ['Patrick'] 因为 'He' 和 'Although' 出现在句子的开头。

它也可以删除标点符号，例如 'He was John, who came' 应该 return 'John' 而不是 'John,'.

Answer 1

你可以使用这个表达式来完成这个任务：

(?<!\.\s)(?!^)\b([A-Z]\w*(?:\s+[A-Z]\w*)*)

正则表达式分解：

Regular Expression to extract Named Entities from text just based on capitalization