将段落中的单词分组并根据列出的顺序为它们分配权重

Question

我有一段最多包含18个不同行业的名称。这些名称由分号分隔。它们出现的顺序对于确定它们的大小也很重要。因此，它必须作为权重分配给名称。该列表可分为 3 类：

报告增长的行业。 2. 报告收缩的行业。 3. 报告没有变化的行业。

在 18 个制造业中，有 12 个在 1 月份报告了增长，顺序如下：塑料和橡胶制品；杂项制造业；服装、皮革及相关产品；纸制品；化学产品;运输设备；食品、饮料和烟草制品；机械;石油和煤炭产品；原生金属；金属制品；和计算机及电子产品。一月份报告收缩的五个行业是：非金属矿产品；木制品；家具及相关产品；电气设备、器具及部件；印刷及相关支持活动。

以上段落是示例。将文本分为 3 类（在本例中为 2 类）并根据列出的顺序为它们分配值的最佳方法是什么？文本中出现一个模式。名称以“:”开始，以“.”结束。有时首先列出报告收缩的行业名称，然后列出报告增长的行业。如何在自动化时克服这个问题？

赋值将取决于每个类别中的行业数量。报告增长的行业具有正值，从 1 一直下降到 1。没有变化的行业将 0 作为默认值，而收缩的行业具有负值，其幅度从 1 一直下降到 -1。然后将这些类别放在一起并排序以按降序获得列表 (+ve, 0, -ve)。还处于编程的早期阶段。请多多包涵。即使是解决策略的建议也会帮助我走很长一段路。

Answer 1

这是适用于您提供的示例的代码，但我不能保证它适用于您拥有的所有示例（特别是因为您没有提供没有更改的示例）。主要思想是使用正则表达式 (import re) 专门查找术语 'growth'、'no change' 和 'contraction'，然后获取每个公司的列表。接下来，三个类别中的每一个都通过列表理解来获得相关分数，以便每个列表条目成为 (company, value) 的元组。最后将这三个类别组合成一个列表，按值（第一个索引）排序，并打印出来。请注意，如果未使用确切的词 'growth'，例如 'increase' 代替它，则此方法无效。

代码：

import re

sample = 'Of the 18 manufacturing industries, 12 reported growth in January in the following order: Plastics & Rubber Products; Miscellaneous Manufacturing; Apparel, Leather & Allied Products; Paper Products; Chemical Products; Transportation Equipment; Food, Beverage & Tobacco Products; Machinery; Petroleum & Coal Products; Primary Metals; Fabricated Metal Products; and Computer & Electronic Products. The five industries reporting contraction in January are: Nonmetallic Mineral Products; Wood Products; Furniture & Related Products; Electrical Equipment, Appliances & Components; and Printing & Related Support Activities.'

#Find the growth industries
growth_pattern = 'growth.*?:(.*?)\.'
growths = re.findall(growth_pattern,sample)
growths = growths[0].strip().split(';') if len(growths) == 1 else []

#Find the no change industries
nochange_pattern = 'no change.*?:(.*?)\.'
nochanges = re.findall(nochange_pattern,sample)
nochanges = nochanges[0].strip().split(';') if len(nochanges) == 1 else []

#Find the contraction industries
contraction_pattern = 'contraction.*?:(.*?)\.'
contractions = re.findall(contraction_pattern,sample)
contractions = contractions[0].strip().split(';') if len(contractions) == 1 else []

#Give numbers to each of the industries
growths = [(g.strip().replace('and ',''),len(growths)-i) for i,g in enumerate(growths)]
nochanges = [(nc.strip().replace('and ',''),0) for i,nc in enumerate(nochanges)]
contractions = [(c.strip().replace('and ',''),-(len(contractions)-i)) for i,c in enumerate(contractions)]

#Print them out to check (commented out for now)
#print('growths:'+str(growths))
#print('nochanges:'+str(nochanges))
#print('contractions:'+str(contractions))

#Combine them all together, sort by value, and print out
all_together = growths+nochanges+contractions
all_together = sorted(all_together,key=lambda x: -x[1])
print all_together

输出：

[('Plastics & Rubber Products', 12), ('Miscellaneous Manufacturing', 11), ('Apparel, Leather & Allied Products', 10), ('Paper Products', 9), ('Chemical Products', 8), ('Transportation Equipment', 7), ('Food, Beverage & Tobacco Products', 6), ('Machinery', 5), ('Petroleum & Coal Products', 4), ('Primary Metals', 3), ('Fabricated Metal Products', 2), ('Computer & Electronic Products', 1), ('Printing & Related Support Activities', -1), ('Electrical Equipment, Appliances & Components', -2), ('Furniture & Related Products', -3), ('Wood Products', -4), ('Nonmetallic Mineral Products', -5)]

将段落中的单词分组并根据列出的顺序为它们分配权重

Categorizing words in paragraphs into groups and assigning weights to them based on listed order

python

string

text-mining

python-2.7