从字符串中解析出发城市/目的地城市

Question

我有一个 pandas 数据框，其中一列是一堆带有特定旅行详细信息的字符串。我的目标是解析每个字符串以提取出发城市和目的地城市（我希望最终有两个新列标题为 'origin' 和 'destination'）。

数据：

df_col = [
    'new york to venice, italy for usd271',
    'return flights from brussels to bangkok with etihad from â‚¬407',
    'from los angeles to guadalajara, mexico for usd191',
    'fly to australia new zealand from paris from â‚¬422 return including 2 checked bags'
]

这应该导致：

Origin: New York, USA; Destination: Venice, Italy
Origin: Brussels, BEL; Destination: Bangkok, Thailand
Origin: Los Angeles, USA; Destination: Guadalajara, Mexico
Origin: Paris, France; Destination: Australia / New Zealand (this is a complicated case given two countries)

到目前为止我已经尝试过：多种 NLTK 方法，但最接近我的是使用 nltk.pos_tag 方法来标记字符串中的每个单词。结果是包含每个单词和关联标签的元组列表。这是一个例子...

[('Fly', 'NNP'), ('to', 'TO'), ('Australia', 'NNP'), ('&', 'CC'), ('New', 'NNP'), ('Zealand', 'NNP'), ('from', 'IN'), ('Paris', 'NNP'), ('from', 'IN'), ('â‚¬422', 'NNP'), ('return', 'NN'), ('including', 'VBG'), ('2', 'CD'), ('checked', 'VBD'), ('bags', 'NNS'), ('!', '.')]

我被困在这个阶段，不确定如何最好地实施它。任何人都可以指出我正确的方向吗？谢谢

Answer 1

TL;DR

乍一看几乎不可能，除非您可以访问一些包含非常复杂组件的 API。

中龙

乍一看，您似乎在要求神奇地解决自然语言问题。但是，让我们将其分解并将其范围界定为可构建的东西。

首先，要识别国家和城市，您需要枚举它们的数据，所以让我们试试：https://www.google.com/search?q=list+of+countries+and+cities+in+the+world+json

在搜索结果的顶部，我们发现 https://datahub.io/core/world-cities 通向 world-cities.json 文件。现在我们将它们加载到一组国家和城市中。

import requests
import json

cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json"
cities_json = json.loads(requests.get(cities_url).content.decode('utf8'))

countries = set([city['country'] for city in cities_json])
cities = set([city['name'] for city in cities_json])

现在给定数据，让我们尝试构建组件 ONE:

任务： 检测文本中是否有任何子字符串匹配 city/country.
工具: https://github.com/vi3k6i5/flashtext(快串search/match)
指标：字符串中正确识别cities/countries的数量

让我们把它们放在一起。

import requests
import json
from flashtext import KeywordProcessor

cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json"
cities_json = json.loads(requests.get(cities_url).content.decode('utf8'))

countries = set([city['country'] for city in cities_json])
cities = set([city['name'] for city in cities_json])


keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))


texts = ['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from â‚¬407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from â‚¬422 return including 2 checked bags']
keyword_processor.extract_keywords(texts[0])

[输出]:

['York', 'Venice', 'Italy']

嘿，怎么了？！

尽职调查，第一个预感是"new york"不在数据中，

>>> "New York" in cities
False

什么？！ #$%^&* 为了理智起见，我们检查这些：

>>> len(countries)
244
>>> len(cities)
21940

是的，您不能只信任一个数据源，所以让我们尝试获取所有数据源。

从 https://www.google.com/search?q=list+of+countries+and+cities+in+the+world+json，你找到另一个 link https://github.com/dr5hn/countries-states-cities-database .

import requests
import json

cities_url = "https://pkgstore.datahub.io/core/world-cities/world-cities_json/data/5b3dd46ad10990bca47b04b4739a02ba/world-cities_json.json"
cities1_json = json.loads(requests.get(cities_url).content.decode('utf8'))

countries1 = set([city['country'] for city in cities1_json])
cities1 = set([city['name'] for city in cities1_json])

dr5hn_cities_url = "https://raw.githubusercontent.com/dr5hn/countries-states-cities-database/master/cities.json"
dr5hn_countries_url = "https://raw.githubusercontent.com/dr5hn/countries-states-cities-database/master/countries.json"

cities2_json = json.loads(requests.get(dr5hn_cities_url).content.decode('utf8'))
countries2_json = json.loads(requests.get(dr5hn_countries_url).content.decode('utf8'))

countries2 = set([c['name'] for c in countries2_json])
cities2 = set([c['name'] for c in cities2_json])

countries = countries2.union(countries1)
cities = cities2.union(cities1)

既然我们神经质，我们就会进行理智检查。

>>> len(countries)
282
>>> len(cities)
127793

哇，这比以前多了很多城市。

让我们再次尝试 flashtext 代码。

from flashtext import KeywordProcessor

keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))

texts = ['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from â‚¬407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from â‚¬422 return including 2 checked bags']

keyword_processor.extract_keywords(texts[0])

[输出]:

['York', 'Venice', 'Italy']

认真的？！没有纽约？！ $%^&*

好的，为了进行更多的完整性检查，让我们在城市列表中查找 "york"。

>>> [c for c in cities if 'york' in c.lower()]
['Yorklyn',
 'West York',
 'West New York',
 'Yorktown Heights',
 'East Riding of Yorkshire',
 'Yorke Peninsula',
 'Yorke Hill',
 'Yorktown',
 'Jefferson Valley-Yorktown',
 'New York Mills',
 'City of York',
 'Yorkville',
 'Yorkton',
 'New York County',
 'East York',
 'East New York',
 'York Castle',
 'York County',
 'Yorketown',
 'New York City',
 'York Beach',
 'Yorkshire',
 'North Yorkshire',
 'Yorkeys Knob',
 'York',
 'York Town',
 'York Harbor',
 'North York']

尤里卡！这是因为它调用的是 "New York City" 而不是 "New York"!

你：这是什么恶作剧？！

语言学家： 欢迎来到 自然语言 处理的世界，在这个世界中，自然语言是一种社会建构，受社区和方言变体的影响。

你: 少废话，告诉我怎么解决。

NLP Practitioner（一个真正的工作于嘈杂的用户生成文本的人）：你只需要添加到列表中。但在此之前，根据您已有的列表检查您的指标。

对于样本 "test set" 中的每个文本，您应该提供一些真实标签以确保您可以 "measure your metric"。

from itertools import zip_longest
from flashtext import KeywordProcessor

keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))

texts_labels = [('new york to venice, italy for usd271', ('New York', 'Venice', 'Italy')),
('return flights from brussels to bangkok with etihad from â‚¬407', ('Brussels', 'Bangkok')),
('from los angeles to guadalajara, mexico for usd191', ('Los Angeles', 'Guadalajara')),
('fly to australia new zealand from paris from â‚¬422 return including 2 checked bags', ('Australia', 'New Zealand', 'Paris'))]

# No. of correctly extracted terms.
true_positives = 0
false_positives = 0
total_truth = 0

for text, label in texts_labels:
    extracted = keyword_processor.extract_keywords(text)

    # We're making some assumptions here that the order of 
    # extracted and the truth must be the same.
    true_positives += sum(1 for e, l in zip_longest(extracted, label) if e == l)
    false_positives += sum(1 for e, l in zip_longest(extracted, label) if e != l)
    total_truth += len(label)

    # Just visualization candies.
    print(text)
    print(extracted)
    print(label)
    print()

实际上，它看起来并没有那么糟糕。我们得到 90% 的准确率：

>>> true_positives / total_truth
0.9

但是我 %^&*(-ing 想要 100% 提取！！

好吧好吧，看看上面的方法报的"only"错误，只是"New York"不在城市列表中。

你：我们为什么不把"New York"添加到城市列表中，即

keyword_processor.add_keyword('New York')

print(texts[0])
print(keyword_processor.extract_keywords(texts[0]))

[输出]:

['New York', 'Venice', 'Italy']

你：看，我做到了！！！现在我应该喝啤酒了。 语言学家：'I live in Marawi'怎么样？

>>> keyword_processor.extract_keywords('I live in Marawi')
[]

NLP从业者（附和）：'I live in Jeju'怎么样？

>>> keyword_processor.extract_keywords('I live in Jeju')
[]

A Raymond Hettinger 粉丝（来自远方）："There must be a better way!"

是的，如果我们只是尝试一些愚蠢的事情，比如将以 "City" 结尾的城市关键字添加到我们的 keyword_processor 中会怎么样？

for c in cities:
    if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
        if c[:-5].strip():
            keyword_processor.add_keyword(c[:-5])
            print(c[:-5])

有效！

现在让我们重试我们的回归测试示例：

from itertools import zip_longest
from flashtext import KeywordProcessor

keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))

for c in cities:
    if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
        if c[:-5].strip():
            keyword_processor.add_keyword(c[:-5])

texts_labels = [('new york to venice, italy for usd271', ('New York', 'Venice', 'Italy')),
('return flights from brussels to bangkok with etihad from â‚¬407', ('Brussels', 'Bangkok')),
('from los angeles to guadalajara, mexico for usd191', ('Los Angeles', 'Guadalajara')),
('fly to australia new zealand from paris from â‚¬422 return including 2 checked bags', ('Australia', 'New Zealand', 'Paris')),
('I live in Florida', ('Florida')), 
('I live in Marawi', ('Marawi')), 
('I live in jeju', ('Jeju'))]

# No. of correctly extracted terms.
true_positives = 0
false_positives = 0
total_truth = 0

for text, label in texts_labels:
    extracted = keyword_processor.extract_keywords(text)

    # We're making some assumptions here that the order of 
    # extracted and the truth must be the same.
    true_positives += sum(1 for e, l in zip_longest(extracted, label) if e == l)
    false_positives += sum(1 for e, l in zip_longest(extracted, label) if e != l)
    total_truth += len(label)

    # Just visualization candies.
    print(text)
    print(extracted)
    print(label)
    print()

[输出]:

new york to venice, italy for usd271
['New York', 'Venice', 'Italy']
('New York', 'Venice', 'Italy')

return flights from brussels to bangkok with etihad from â‚¬407
['Brussels', 'Bangkok']
('Brussels', 'Bangkok')

from los angeles to guadalajara, mexico for usd191
['Los Angeles', 'Guadalajara', 'Mexico']
('Los Angeles', 'Guadalajara')

fly to australia new zealand from paris from â‚¬422 return including 2 checked bags
['Australia', 'New Zealand', 'Paris']
('Australia', 'New Zealand', 'Paris')

I live in Florida
['Florida']
Florida

I live in Marawi
['Marawi']
Marawi

I live in jeju
['Jeju']
Jeju

100% 是的，NLP-bunga!!!

但说真的，这只是问题的冰山一角。如果你有这样的句子会发生什么：

>>> keyword_processor.extract_keywords('Adam flew to Bangkok from Singapore and then to China')
['Adam', 'Bangkok', 'Singapore', 'China']

为什么 Adam 被提取为城市？！

然后你再做一些神经质检查：

>>> 'Adam' in cities
Adam

恭喜，你又跳进了一个多义词的NLP兔子洞，同一个词有不同的意思，在这种情况下，Adam很可能指的是句子中的一个人，但巧合的是城市名称（根据您提取的数据）。

我明白你在那里做了什么......即使我们忽略这个多义词废话，你仍然没有给我想要的输出：

[在]:

['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from â‚¬407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from â‚¬422 return including 2 checked bags'
]

[输出]:

Origin: New York, USA; Destination: Venice, Italy
Origin: Brussels, BEL; Destination: Bangkok, Thailand
Origin: Los Angeles, USA; Destination: Guadalajara, Mexico
Origin: Paris, France; Destination: Australia / New Zealand (this is a complicated case given two countries)

语言学家：即使假设城市前面的介词（例如from、to）给你"origin" / "destination" 标签，您将如何处理 "multi-leg" 航班的情况，例如

>>> keyword_processor.extract_keywords('Adam flew to Bangkok from Singapore and then to China')

这句话的期望输出是什么：

> Adam flew to Bangkok from Singapore and then to China

大概是这样的？规格是什么？您的输入文本的（非）结构化程度如何？

> Origin: Singapore
> Departure: Bangkok
> Departure: China

让我们尝试构建组件二来检测介词。

让我们假设您已经拥有并尝试对相同的 flashtext 方法进行一些修改。

如果我们将 to 和 from 添加到列表中会怎么样？

from itertools import zip_longest
from flashtext import KeywordProcessor

keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))

for c in cities:
    if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
        if c[:-5].strip():
            keyword_processor.add_keyword(c[:-5])

keyword_processor.add_keyword('to')
keyword_processor.add_keyword('from')

texts = ['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from â‚¬407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from â‚¬422 return including 2 checked bags']


for text in texts:
    extracted = keyword_processor.extract_keywords(text)
    print(text)
    print(extracted)
    print()

[输出]:

new york to venice, italy for usd271
['New York', 'to', 'Venice', 'Italy']

return flights from brussels to bangkok with etihad from â‚¬407
['from', 'Brussels', 'to', 'Bangkok', 'from']

from los angeles to guadalajara, mexico for usd191
['from', 'Los Angeles', 'to', 'Guadalajara', 'Mexico']

fly to australia new zealand from paris from â‚¬422 return including 2 checked bags
['to', 'Australia', 'New Zealand', 'from', 'Paris', 'from']

嘿，使用 to/from、

的规则很糟糕

如果"from"指的是票价怎么办？
如果 country/city 前面没有 "to/from" 怎么办？

好的，让我们使用上面的输出，看看我们对问题 1 做了什么。也许检查 from 后面的术语是否是 city，如果不是，请删除 to/from？

from itertools import zip_longest
from flashtext import KeywordProcessor

keyword_processor = KeywordProcessor(case_sensitive=False)
keyword_processor.add_keywords_from_list(sorted(countries))
keyword_processor.add_keywords_from_list(sorted(cities))

for c in cities:
    if 'city' in c.lower() and c.endswith('City') and c[:-5] not in cities:
        if c[:-5].strip():
            keyword_processor.add_keyword(c[:-5])

keyword_processor.add_keyword('to')
keyword_processor.add_keyword('from')

texts = ['new york to venice, italy for usd271',
'return flights from brussels to bangkok with etihad from â‚¬407',
'from los angeles to guadalajara, mexico for usd191',
'fly to australia new zealand from paris from â‚¬422 return including 2 checked bags']


for text in texts:
    extracted = keyword_processor.extract_keywords(text)
    print(text)

    new_extracted = []
    extracted_next = extracted[1:]
    for e_i, e_iplus1 in zip_longest(extracted, extracted_next):
        if e_i == 'from' and e_iplus1 not in cities and e_iplus1 not in countries:
            print(e_i, e_iplus1)
            continue
        elif e_i == 'from' and e_iplus1 == None: # last word in the list.
            continue
        else:
            new_extracted.append(e_i)

    print(new_extracted)
    print()

这似乎可以解决问题并删除不在 city/country 之前的 from。

[输出]:

new york to venice, italy for usd271
['New York', 'to', 'Venice', 'Italy']

return flights from brussels to bangkok with etihad from â‚¬407
from None
['from', 'Brussels', 'to', 'Bangkok']

from los angeles to guadalajara, mexico for usd191
['from', 'Los Angeles', 'to', 'Guadalajara', 'Mexico']

fly to australia new zealand from paris from â‚¬422 return including 2 checked bags
from None
['to', 'Australia', 'New Zealand', 'from', 'Paris']

但是"from New York"还是没有解决！！

语言学家：仔细想想，是否应该通过做出明智的决定使歧义短语显而易见来解决歧义？如果是这样，知情决定中的 "information" 是什么？是不是应该先按照一定的模板检测信息再填歧义？

你：我对你快没耐心了……你把我兜兜转转，我一直听到的能听懂人类语言的人工智能在哪里来自新闻和 Google 以及 Facebook 和所有？！

你：你给我的都是基于规则的，AI在哪？

NLP从业者：你不是要100%吗？编写 "business logics" 或基于规则的系统将是在给定特定数据集的情况下真正实现“100%”的唯一方法，而无需任何可用于 "training an AI".

的预设数据集

你：训练AI是什么意思？为什么我不能只使用 Google 或 Facebook 或亚马逊或微软甚至 IBM 的人工智能？

NLP从业者：给大家介绍一下

欢迎来到计算语言学和 NLP 的世界！

简而言之

是的，没有真正现成的神奇解决方案，如果您想使用 "AI" 或机器学习算法，很可能您需要更多的训练数据，例如 texts_labels 对如上例所示。

从字符串中解析出发城市/目的地城市

Parsing city of origin / destination city from a string

python

regex

nlp

nltk

pandas

TL;DR

中龙

现在给定数据，让我们尝试构建组件 ONE:

嘿，怎么了？！

既然我们神经质，我们就会进行理智检查。

认真的？！没有纽约？！ $%^&*

尤里卡！这是因为它调用的是 "New York City" 而不是 "New York"!

对于样本 "test set" 中的每个文本，您应该提供一些真实标签以确保您可以 "measure your metric"。

但是我 %^&*(-ing 想要 100% 提取！！

有效！

100% 是的，NLP-bunga!!!

我明白你在那里做了什么......即使我们忽略这个多义词废话，你仍然没有给我想要的输出：

让我们尝试构建组件二来检测介词。

嘿，使用 to/from、

但是"from New York"还是没有解决！！

简而言之

从字符串中解析出发城市/目的地城市

Parsing city of origin / destination city from a string

python

regex

nlp

nltk

pandas

TL;DR

中龙

现在给定数据，让我们尝试构建 组件 ONE:

嘿，怎么了？！

既然我们神经质，我们就会进行理智检查。

认真的？！没有纽约？！ $%^&*

尤里卡！这是因为它调用的是 "New York City" 而不是 "New York"!

对于样本 "test set" 中的每个文本，您应该提供一些真实标签以确保您可以 "measure your metric"。

但是我 %^&*(-ing 想要 100% 提取！！

有效！

100% 是的，NLP-bunga!!!

我明白你在那里做了什么......即使我们忽略这个多义词废话，你仍然没有给我想要的输出：

让我们尝试构建组件二来检测介词。

嘿，使用 to/from、

但是"from New York"还是没有解决！！

简而言之

现在给定数据，让我们尝试构建组件 ONE: