将段落中的单词分组并根据列出的顺序为它们分配权重
Categorizing words in paragraphs into groups and assigning weights to them based on listed order
我有一段最多包含18个不同行业的名称。这些名称由分号分隔。它们出现的顺序对于确定它们的大小也很重要。因此,它必须作为权重分配给名称。该列表可分为 3 类:
- 报告增长的行业。 2. 报告收缩的行业。 3. 报告没有变化的行业。
在 18 个制造业中,有 12 个在 1 月份报告了增长,顺序如下:塑料和橡胶制品;杂项制造业;服装、皮革及相关产品;纸制品;化学产品;运输设备;食品、饮料和烟草制品;机械;石油和煤炭产品;原生金属;金属制品;和计算机及电子产品。一月份报告收缩的五个行业是:非金属矿产品;木制品;家具及相关产品;电气设备、器具及部件;印刷及相关支持活动。
以上段落是示例。将文本分为 3 类(在本例中为 2 类)并根据列出的顺序为它们分配值的最佳方法是什么?文本中出现一个模式。名称以“:”开始,以“.”结束。
有时首先列出报告收缩的行业名称,然后列出报告增长的行业。如何在自动化时克服这个问题?
赋值将取决于每个类别中的行业数量。报告增长的行业具有正值,从 1 一直下降到 1。没有变化的行业将 0 作为默认值,而收缩的行业具有负值,其幅度从 1 一直下降到 -1。然后将这些类别放在一起并排序以按降序获得列表 (+ve, 0, -ve)。还处于编程的早期阶段。请多多包涵。即使是解决策略的建议也会帮助我走很长一段路。
这是适用于您提供的示例的代码,但我不能保证它适用于您拥有的所有示例(特别是因为您没有提供没有更改的示例)。主要思想是使用正则表达式 (import re
) 专门查找术语 'growth'、'no change' 和 'contraction',然后获取每个公司的列表。接下来,三个类别中的每一个都通过列表理解来获得相关分数,以便每个列表条目成为 (company, value)
的元组。最后将这三个类别组合成一个列表,按值(第一个索引)排序,并打印出来。请注意,如果未使用确切的词 'growth',例如 'increase' 代替它,则此方法无效。
代码:
import re
sample = 'Of the 18 manufacturing industries, 12 reported growth in January in the following order: Plastics & Rubber Products; Miscellaneous Manufacturing; Apparel, Leather & Allied Products; Paper Products; Chemical Products; Transportation Equipment; Food, Beverage & Tobacco Products; Machinery; Petroleum & Coal Products; Primary Metals; Fabricated Metal Products; and Computer & Electronic Products. The five industries reporting contraction in January are: Nonmetallic Mineral Products; Wood Products; Furniture & Related Products; Electrical Equipment, Appliances & Components; and Printing & Related Support Activities.'
#Find the growth industries
growth_pattern = 'growth.*?:(.*?)\.'
growths = re.findall(growth_pattern,sample)
growths = growths[0].strip().split(';') if len(growths) == 1 else []
#Find the no change industries
nochange_pattern = 'no change.*?:(.*?)\.'
nochanges = re.findall(nochange_pattern,sample)
nochanges = nochanges[0].strip().split(';') if len(nochanges) == 1 else []
#Find the contraction industries
contraction_pattern = 'contraction.*?:(.*?)\.'
contractions = re.findall(contraction_pattern,sample)
contractions = contractions[0].strip().split(';') if len(contractions) == 1 else []
#Give numbers to each of the industries
growths = [(g.strip().replace('and ',''),len(growths)-i) for i,g in enumerate(growths)]
nochanges = [(nc.strip().replace('and ',''),0) for i,nc in enumerate(nochanges)]
contractions = [(c.strip().replace('and ',''),-(len(contractions)-i)) for i,c in enumerate(contractions)]
#Print them out to check (commented out for now)
#print('growths:'+str(growths))
#print('nochanges:'+str(nochanges))
#print('contractions:'+str(contractions))
#Combine them all together, sort by value, and print out
all_together = growths+nochanges+contractions
all_together = sorted(all_together,key=lambda x: -x[1])
print all_together
输出:
[('Plastics & Rubber Products', 12), ('Miscellaneous Manufacturing', 11), ('Apparel, Leather & Allied Products', 10), ('Paper Products', 9), ('Chemical Products', 8), ('Transportation Equipment', 7), ('Food, Beverage & Tobacco Products', 6), ('Machinery', 5), ('Petroleum & Coal Products', 4), ('Primary Metals', 3), ('Fabricated Metal Products', 2), ('Computer & Electronic Products', 1), ('Printing & Related Support Activities', -1), ('Electrical Equipment, Appliances & Components', -2), ('Furniture & Related Products', -3), ('Wood Products', -4), ('Nonmetallic Mineral Products', -5)]
我有一段最多包含18个不同行业的名称。这些名称由分号分隔。它们出现的顺序对于确定它们的大小也很重要。因此,它必须作为权重分配给名称。该列表可分为 3 类:
- 报告增长的行业。 2. 报告收缩的行业。 3. 报告没有变化的行业。
在 18 个制造业中,有 12 个在 1 月份报告了增长,顺序如下:塑料和橡胶制品;杂项制造业;服装、皮革及相关产品;纸制品;化学产品;运输设备;食品、饮料和烟草制品;机械;石油和煤炭产品;原生金属;金属制品;和计算机及电子产品。一月份报告收缩的五个行业是:非金属矿产品;木制品;家具及相关产品;电气设备、器具及部件;印刷及相关支持活动。
以上段落是示例。将文本分为 3 类(在本例中为 2 类)并根据列出的顺序为它们分配值的最佳方法是什么?文本中出现一个模式。名称以“:”开始,以“.”结束。 有时首先列出报告收缩的行业名称,然后列出报告增长的行业。如何在自动化时克服这个问题?
赋值将取决于每个类别中的行业数量。报告增长的行业具有正值,从 1 一直下降到 1。没有变化的行业将 0 作为默认值,而收缩的行业具有负值,其幅度从 1 一直下降到 -1。然后将这些类别放在一起并排序以按降序获得列表 (+ve, 0, -ve)。还处于编程的早期阶段。请多多包涵。即使是解决策略的建议也会帮助我走很长一段路。
这是适用于您提供的示例的代码,但我不能保证它适用于您拥有的所有示例(特别是因为您没有提供没有更改的示例)。主要思想是使用正则表达式 (import re
) 专门查找术语 'growth'、'no change' 和 'contraction',然后获取每个公司的列表。接下来,三个类别中的每一个都通过列表理解来获得相关分数,以便每个列表条目成为 (company, value)
的元组。最后将这三个类别组合成一个列表,按值(第一个索引)排序,并打印出来。请注意,如果未使用确切的词 'growth',例如 'increase' 代替它,则此方法无效。
代码:
import re
sample = 'Of the 18 manufacturing industries, 12 reported growth in January in the following order: Plastics & Rubber Products; Miscellaneous Manufacturing; Apparel, Leather & Allied Products; Paper Products; Chemical Products; Transportation Equipment; Food, Beverage & Tobacco Products; Machinery; Petroleum & Coal Products; Primary Metals; Fabricated Metal Products; and Computer & Electronic Products. The five industries reporting contraction in January are: Nonmetallic Mineral Products; Wood Products; Furniture & Related Products; Electrical Equipment, Appliances & Components; and Printing & Related Support Activities.'
#Find the growth industries
growth_pattern = 'growth.*?:(.*?)\.'
growths = re.findall(growth_pattern,sample)
growths = growths[0].strip().split(';') if len(growths) == 1 else []
#Find the no change industries
nochange_pattern = 'no change.*?:(.*?)\.'
nochanges = re.findall(nochange_pattern,sample)
nochanges = nochanges[0].strip().split(';') if len(nochanges) == 1 else []
#Find the contraction industries
contraction_pattern = 'contraction.*?:(.*?)\.'
contractions = re.findall(contraction_pattern,sample)
contractions = contractions[0].strip().split(';') if len(contractions) == 1 else []
#Give numbers to each of the industries
growths = [(g.strip().replace('and ',''),len(growths)-i) for i,g in enumerate(growths)]
nochanges = [(nc.strip().replace('and ',''),0) for i,nc in enumerate(nochanges)]
contractions = [(c.strip().replace('and ',''),-(len(contractions)-i)) for i,c in enumerate(contractions)]
#Print them out to check (commented out for now)
#print('growths:'+str(growths))
#print('nochanges:'+str(nochanges))
#print('contractions:'+str(contractions))
#Combine them all together, sort by value, and print out
all_together = growths+nochanges+contractions
all_together = sorted(all_together,key=lambda x: -x[1])
print all_together
输出:
[('Plastics & Rubber Products', 12), ('Miscellaneous Manufacturing', 11), ('Apparel, Leather & Allied Products', 10), ('Paper Products', 9), ('Chemical Products', 8), ('Transportation Equipment', 7), ('Food, Beverage & Tobacco Products', 6), ('Machinery', 5), ('Petroleum & Coal Products', 4), ('Primary Metals', 3), ('Fabricated Metal Products', 2), ('Computer & Electronic Products', 1), ('Printing & Related Support Activities', -1), ('Electrical Equipment, Appliances & Components', -2), ('Furniture & Related Products', -3), ('Wood Products', -4), ('Nonmetallic Mineral Products', -5)]