如何使用 C# 从文本中提取人名和地名?

How can personal and place names be extracted from text using C#?

是否有任何 C# 算法可以从文本中提取人名和地名?

例如,给定以下文本:

St. Mark died at Alexandria, in Egypt.  He was martyred, I think.
However, that has nothing to do with my legend.  About the founding of
the city of Venice--

(摘自马克·吐温 "The Innocents Abroad")

...有没有办法提取:

St. Mark
Alexandria (or better yet, "Alexandria, Egypt")
Venice

?

我知道没有办法达到 100% 的准确率(所有地名和人名都被捕获,并且没有添加 "false positives"),但是 80% 的准确率可能非常有价值。

我知道每个词都可以与百科全书之类的东西进行比较,但必须有更好的方法。此外,算法如何知道组合 "St." 和 "Mark" 并将 "Alexandria, in Egypt" 视为 "Alexandria, Egypt"?

你最好使用某种能够执行这种实体匹配的 API,因为你所要求的可能非常复杂并且需要某种程度的语义文本分析支持一个大数据库。我建议查看 APIs,例如:

OpenCalais - English Semantic Metadata: Entity/Fact/Event Definitions and Descriptions web-service

Calais supports a rich set of semantic metadata, including entities, events and facts.

Alchemy API - Entity Extraction API

AlchemyAPI is capable of identifying people, companies, organizations, cities, geographic features, and other typed entities within your HTML, text, or web-based content. We employ sophisticated statistical algorithms and natural language processing technology to analyze your information, extracting the semantic richness embedded within.

我注意到这里提供的链接有点过时了。斯坦福自然语言处理 (NLP) 库 (https://nlp.stanford.edu/software/). You can demo their Named Entity Recognition (NER) here. It even has a .NET wrapper (http://sergey-tihon.github.io/Stanford.NLP.NET/StanfordNER.html).

是一个仍然活跃的项目(并且是免费的 [更正:GPL,因此非商业免费])

Microsoft 还通过 Azure 认知服务提供了许多类似的算法。您可能对实体链接最感兴趣 (https://azure.microsoft.com/en-us/services/cognitive-services/entity-linking-intelligence-service/)

希望对以后的观众有所帮助。