如何通过 nltk python 标记文本

how to tokenize a text by nltk python

我有这样的文字:

Exception in org.baharan.dominant.dao.core.nonPlanAllocation.INonPlanAllocationRepository.getAllGrid() 
with cause = 'org.hibernate.exception.SQLGrammarException: could not extract ResultSet'
Caused by: java.sql.SQLSyntaxErrorException: ORA-00942: table or view does not exist

我在 python 中用 word_tokenize 标记了这段文本,输出是:

Exception
org.baharan.dominant.dao.core.nonPlanAllocation.INonPlanAllocationRepository.getAllGrid
cause
'org.hibernate.exception.SQLGrammarException
could
extract
ResultSet'
Caused
java.sql.SQLSyntaxErrorException
ORA-00942
table
view
exist

但是如你所见,第二行输出了几个点缀在一起的单词。如何把它们分开成一个词?!

我使用这个 python 代码:

>>> f = open('001.txt')
>>> text = [w for w in word_tokenize(f.read()) if w not in stopwords]

事实上,我希望所有单词都像这样分开:

Exception
org
baharan
dominant
dao
core
nonPlanAllocation
INonPlanAllocationRepository
getAllGrid
cause
'org
hibernate
exception
SQLGrammarException
could
extract
ResultSet'
Caused
java
sql
SQLSyntaxErrorException
ORA-00942
table
view
exist
f = "Exception in org.baharan.dominant.dao.core.nonPlanAllocation.INonPlanAllocationRepository.getAllGrid() \
with cause = 'org.hibernate.exception.SQLGrammarException: could not extract ResultSet' \
Caused by: java.sql.SQLSyntaxErrorException: ORA-00942: table or view does not exist'"
s = ''
f_list = f.replace('.', ' ').split(' ')
for item in f_list:
    #print(item)
    s = s + ' ' + ''.join(item)+'\n'

print(s)

输出

 Exception
 in
 org
 baharan
 dominant
 dao
 core
 nonPlanAllocation
 INonPlanAllocationRepository
 getAllGrid()
 with
 cause
 =
 'org
 hibernate
 exception
 SQLGrammarException:
 could
 not
 extract
 ResultSet'
 Caused
 by:
 java
 sql
 SQLSyntaxErrorException:
 ORA-00942:
 table
 or
 view
 does
 not
 exist'

我找到了一种简单的方法,可以像这样使用 nltk.tokenize 的 RegexpTokenizer:

>>> from nltk.tokenize import RegexpTokenizer
>>> tokenizer = RegexpTokenizer(r'\w+')

考虑删除停用词后的输出如下:

Exception
org
baharan
dominant
dao
core
nonPlanAllocation
INonPlanAllocationRepository
getAllGrid
cause
org
hibernate
exception
SQLGrammarException
could
extract
ResultSet
Caused
java
sql
SQLSyntaxErrorException
ORA-00942
table
view
exist