在 Python 中使用字典作为正则表达式

Question

我有一个 Python 问题，希望得到一些帮助。

让我们从重要的部分开始，这是我当前的代码：

import re #for regex
import numpy as np #for matrix

f1 = open('file-to-analyze.txt','r') #file to analyze

#convert files of words into arrays. 
#These words are used to be matched against in the "file-to-analyze"
math = open('sample_math.txt','r')
matharray = list(math.read().split())
math.close()

logic = open('sample_logic.txt','r')
logicarray = list(logic.read().split())
logic.close()

priv = open ('sample_priv.txt','r')
privarray = list(priv.read().split())
priv.close()

... Read in 5 more files and make associated arrays

#convert arrays into dictionaries
math_dict = dict()
math_dict.update(dict.fromkeys(matharray,0))

logic_dict = dict()
logic_dict.update(dict.fromkeys(logicarray,1))

...Make more dictionaries from the arrays (8 total dictionaries - the same number as there are arrays)

#create big dictionary of all keys
word_set = dict(math_dict.items() + logic_dict.items() + priv_dict.items() ... )

statelist = list()

for line in f1:
     for word in word_set:
         for m in re.finditer(word, line):
            print word.value()

该程序的目标是获取一个大文本文件并对其进行分析。本质上，我希望程序循环遍历文本文件并匹配在 Python 词典中找到的单词并将它们与类别相关联并在列表中跟踪它。

例如，假设我正在解析文件并且运行跨越单词 "ADD"。 ADD 列在 "math" 或“0”字词类别下。然后程序应该将它添加到一个列表中，它运行跨越 0 类别，然后继续解析文件。本质上生成一个看起来像 [0,4,6,7,4,3,4,1,2,7,1,2,2,2,4...] 的大列表，每个数字对应一个如上所述的特定状态或类别的单词。为了便于理解，我们将这个大列表称为 'statelist'

从我的代码中可以看出，到目前为止，我可以将文件作为输入进行分析，将包含单词列表的文本文件提取并存储到数组中，然后从那里存储到具有正确对应列表值的字典中（1 - 7 的数值）。但是，我在分析部分遇到了问题。

正如您从我的代码中可以看出的那样，我正在尝试逐行浏览文本文件，并使用词典对找到的任何单词进行正则表达式。这是通过一个循环和正则表达式与一个额外的第 9 个字典来完成的，该字典或多或少是一个 "super" 字典，以帮助简化解析。

但是，我在匹配文件中的所有单词时遇到问题，当我找到单词时，将其与字典值而不是键匹配。那是当它运行并 "ADD" 将 0 添加到列表时，因为它是 0 或 "math" 类别的一部分。

有人能帮我弄清楚如何编写这个脚本吗？对此，我真的非常感激！很抱歉 post，但代码需要大量解释，以便您了解发生了什么。非常感谢您的帮助！

Answer 1

对现有代码最简单的更改就是在循环中跟踪单词和类别：

for line in f1:
    for word, category in word_set.iteritems():
        for m in re.finditer(word, line):
            print word, category
            statelist.append(category)

在 Python 中使用字典作为正则表达式

Using a dictionary as regex in Python

python

regex

parsing

dictionary

list