如何删除 nltk 中斜杠前的 POS 标签?
How can I remove POS tags before slashes in nltk?
这是我项目的一部分,我需要像这样表示短语检测后的输出 - (a,x,b),其中 a、x、b 是短语。我构建了代码并得到了这样的输出:
(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
(CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP))
我想让它像以前的表示一样,这意味着我必须删除 'CLAUSE'、'NP'、'VP'、'VBD'、'NNP'等标签。
怎么做?
我试过的
首先将其写入文本文件,标记化并使用 list.remove('word')
。但这根本没有帮助。
我再澄清一点。
我的输入
(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
输出将是
[Jack,loved,Peter], [Jack,stayed,in London]
输出只是根据大括号,没有标签。
我正在尝试这样的事情:
import re
tmp = '(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))'
tmp = re.split(r'[()/ ]', tmp)
#Use 're.split()' to split by character that was not a letter.
>>> ['', 'CLAUSE', '', 'NP', 'Jack', 'NNP', '', '', 'VP', 'loved', 'VBD', '', '', 'NP', 'Peter', 'NNP', '', '']
result = (tmp[4], tmp[9], tmp[14])
>>> ('Jack', 'loved', 'Peter')
这是你想要的吗?
编辑:
我应该考虑清楚:(.
import re
tmp = '(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))'
tmp = re.sub(r'[()]', '', tmp)
>>> 'CLAUSE NP Jack/NNP VP loved/VBD NP Peter/NNP'
result = re.findall(r'[a-zA-Z]*/', tmp)
>>> ['Jack/', 'loved/', 'Peter/']
#Now create a generator.
gen = (i[:-1] for i in result)
tuple(gen)
>>> ('Jack', 'loved', 'Peter')
>>> import re
>>> clause = "(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))"
>>> pattern = r'\w+:?(?=\/)'
>>> re.findall(pattern, clause)
['Jack', 'loved', 'Peter']
已编辑
对于多个子句:
>>> import re
>>> pattern = r'\w+:?(?=\/)'
>>> clauses = """(CLAUSE (NP school/NN) (VP is/VBZ situated/VBN) (NP in/IN London/NNP)) (CLAUSE (NP The/DT color/NN of/IN the/DT sky/NN) (VP is/VBZ) (NP pink/NN))"""
>>> [re.findall(pattern, clause) for clause in clauses.split(' (CLAUSE ')]
[['school', 'is', 'situated', 'in', 'London'], ['The', 'color', 'of', 'the', 'sky', 'is', 'pink']]
如果子句用换行符分隔:
>>> import re
>>> pattern = r'\w+:?(?=\/)'
>>> clauses = """(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
... (CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
... (CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP))"""
>>> [re.findall(pattern, clause) for clause in clauses.split('\n')]
[['Jack', 'loved', 'Peter'], ['Jack', 'stayed', 'in', 'London'], ['Tom', 'is', 'in', 'Kolkata']]
将输出连接成一个字符串:
>>> " ".join(['Jack', 'loved', 'Peter'])
'Jack loved Peter'
>>> clauses = [['Jack', 'loved', 'Peter'], ['Jack', 'stayed', 'in', 'London'], ['Tom', 'is', 'in', 'Kolkata']]
>>> [" ".join(cl) for cl in clauses]
['Jack loved Peter', 'Jack stayed in London', 'Tom is in Kolkata']
既然你标记了这个 nltk
,让我们使用 NLTK 的树解析器来处理你的树。我们将阅读每棵树,然后简单地打印出树叶。完成。
>>> text ="(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))"
>>> tree = nltk.Tree.fromstring(text, read_leaf=lambda x: x.split("/")[0])
>>> print(tree.leaves())
['Jack', 'stayed', 'in', 'London']
lambda 形式拆分每个 word/tag
对并丢弃标签,只保留单词。
多棵树
我知道,您会问我如何处理整个文件的此类树,其中一些需要不止一行。那是 NLTK 的 BracketParseCorpusReader
的工作,但它希望终端采用 (POS word)
的形式而不是 word/POS
。我不会那样做,因为更容易欺骗 Tree.fromstring()
读取所有树,就好像它们是一棵树的分支一样:
allmytext = """
(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
(CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP))
"""
wrapped = "(ROOT "+ allmytext + " )" # Add a "root" node at the top
trees = nltk.Tree.fromstring(wrapped, read_leaf=lambda x: x.split("/")[0])
for tree in trees:
print(tree.leaves())
如您所见,唯一的区别是我们在文件内容周围添加了 "(ROOT "
和 " )"
,并使用 for 循环生成输出。循环为我们提供了顶级节点的子节点,即实际的树。
当输出是这些时:
(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
(CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP))
很明显它们是树。所以只需使用 Tree.leaves()。
这是完整的代码:
def leaves(self):
"""
Return the leaves of the tree.
>>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))")
>>> t.leaves()
['the', 'dog', 'chased', 'the', 'cat']
:return: a list containing this tree's leaves.
The order reflects the order of the
leaves in the tree's hierarchical structure.
:rtype: list
"""
leaves = []
for child in self:
if isinstance(child, Tree):
leaves.extend(child.leaves())
else:
leaves.append(child)
return leaves
这是我项目的一部分,我需要像这样表示短语检测后的输出 - (a,x,b),其中 a、x、b 是短语。我构建了代码并得到了这样的输出:
(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
(CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP))
我想让它像以前的表示一样,这意味着我必须删除 'CLAUSE'、'NP'、'VP'、'VBD'、'NNP'等标签。
怎么做?
我试过的
首先将其写入文本文件,标记化并使用 list.remove('word')
。但这根本没有帮助。
我再澄清一点。
我的输入
(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
输出将是
[Jack,loved,Peter], [Jack,stayed,in London] 输出只是根据大括号,没有标签。
我正在尝试这样的事情:
import re
tmp = '(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))'
tmp = re.split(r'[()/ ]', tmp)
#Use 're.split()' to split by character that was not a letter.
>>> ['', 'CLAUSE', '', 'NP', 'Jack', 'NNP', '', '', 'VP', 'loved', 'VBD', '', '', 'NP', 'Peter', 'NNP', '', '']
result = (tmp[4], tmp[9], tmp[14])
>>> ('Jack', 'loved', 'Peter')
这是你想要的吗?
编辑:
我应该考虑清楚:(.
import re
tmp = '(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))'
tmp = re.sub(r'[()]', '', tmp)
>>> 'CLAUSE NP Jack/NNP VP loved/VBD NP Peter/NNP'
result = re.findall(r'[a-zA-Z]*/', tmp)
>>> ['Jack/', 'loved/', 'Peter/']
#Now create a generator.
gen = (i[:-1] for i in result)
tuple(gen)
>>> ('Jack', 'loved', 'Peter')
>>> import re
>>> clause = "(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))"
>>> pattern = r'\w+:?(?=\/)'
>>> re.findall(pattern, clause)
['Jack', 'loved', 'Peter']
已编辑
对于多个子句:
>>> import re
>>> pattern = r'\w+:?(?=\/)'
>>> clauses = """(CLAUSE (NP school/NN) (VP is/VBZ situated/VBN) (NP in/IN London/NNP)) (CLAUSE (NP The/DT color/NN of/IN the/DT sky/NN) (VP is/VBZ) (NP pink/NN))"""
>>> [re.findall(pattern, clause) for clause in clauses.split(' (CLAUSE ')]
[['school', 'is', 'situated', 'in', 'London'], ['The', 'color', 'of', 'the', 'sky', 'is', 'pink']]
如果子句用换行符分隔:
>>> import re
>>> pattern = r'\w+:?(?=\/)'
>>> clauses = """(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
... (CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
... (CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP))"""
>>> [re.findall(pattern, clause) for clause in clauses.split('\n')]
[['Jack', 'loved', 'Peter'], ['Jack', 'stayed', 'in', 'London'], ['Tom', 'is', 'in', 'Kolkata']]
将输出连接成一个字符串:
>>> " ".join(['Jack', 'loved', 'Peter'])
'Jack loved Peter'
>>> clauses = [['Jack', 'loved', 'Peter'], ['Jack', 'stayed', 'in', 'London'], ['Tom', 'is', 'in', 'Kolkata']]
>>> [" ".join(cl) for cl in clauses]
['Jack loved Peter', 'Jack stayed in London', 'Tom is in Kolkata']
既然你标记了这个 nltk
,让我们使用 NLTK 的树解析器来处理你的树。我们将阅读每棵树,然后简单地打印出树叶。完成。
>>> text ="(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))"
>>> tree = nltk.Tree.fromstring(text, read_leaf=lambda x: x.split("/")[0])
>>> print(tree.leaves())
['Jack', 'stayed', 'in', 'London']
lambda 形式拆分每个 word/tag
对并丢弃标签,只保留单词。
多棵树
我知道,您会问我如何处理整个文件的此类树,其中一些需要不止一行。那是 NLTK 的 BracketParseCorpusReader
的工作,但它希望终端采用 (POS word)
的形式而不是 word/POS
。我不会那样做,因为更容易欺骗 Tree.fromstring()
读取所有树,就好像它们是一棵树的分支一样:
allmytext = """
(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
(CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP))
"""
wrapped = "(ROOT "+ allmytext + " )" # Add a "root" node at the top
trees = nltk.Tree.fromstring(wrapped, read_leaf=lambda x: x.split("/")[0])
for tree in trees:
print(tree.leaves())
如您所见,唯一的区别是我们在文件内容周围添加了 "(ROOT "
和 " )"
,并使用 for 循环生成输出。循环为我们提供了顶级节点的子节点,即实际的树。
当输出是这些时:
(CLAUSE (NP Jack/NNP) (VP loved/VBD) (NP Peter/NNP))
(CLAUSE (NP Jack/NNP) (VP stayed/VBD) (NP in/IN London/NNP))
(CLAUSE (NP Tom/NNP) (VP is/VBZ) (NP in/IN Kolkata/NNP))
很明显它们是树。所以只需使用 Tree.leaves()。
这是完整的代码:
def leaves(self):
"""
Return the leaves of the tree.
>>> t = Tree.fromstring("(S (NP (D the) (N dog)) (VP (V chased) (NP (D the) (N cat))))")
>>> t.leaves()
['the', 'dog', 'chased', 'the', 'cat']
:return: a list containing this tree's leaves.
The order reflects the order of the
leaves in the tree's hierarchical structure.
:rtype: list
"""
leaves = []
for child in self:
if isinstance(child, Tree):
leaves.extend(child.leaves())
else:
leaves.append(child)
return leaves