如何通过 nltk python 标记文本
how to tokenize a text by nltk python
我有这样的文字:
Exception in org.baharan.dominant.dao.core.nonPlanAllocation.INonPlanAllocationRepository.getAllGrid()
with cause = 'org.hibernate.exception.SQLGrammarException: could not extract ResultSet'
Caused by: java.sql.SQLSyntaxErrorException: ORA-00942: table or view does not exist
我在 python 中用 word_tokenize 标记了这段文本,输出是:
Exception
org.baharan.dominant.dao.core.nonPlanAllocation.INonPlanAllocationRepository.getAllGrid
cause
'org.hibernate.exception.SQLGrammarException
could
extract
ResultSet'
Caused
java.sql.SQLSyntaxErrorException
ORA-00942
table
view
exist
但是如你所见,第二行输出了几个点缀在一起的单词。如何把它们分开成一个词?!
我使用这个 python 代码:
>>> f = open('001.txt')
>>> text = [w for w in word_tokenize(f.read()) if w not in stopwords]
事实上,我希望所有单词都像这样分开:
Exception
org
baharan
dominant
dao
core
nonPlanAllocation
INonPlanAllocationRepository
getAllGrid
cause
'org
hibernate
exception
SQLGrammarException
could
extract
ResultSet'
Caused
java
sql
SQLSyntaxErrorException
ORA-00942
table
view
exist
f = "Exception in org.baharan.dominant.dao.core.nonPlanAllocation.INonPlanAllocationRepository.getAllGrid() \
with cause = 'org.hibernate.exception.SQLGrammarException: could not extract ResultSet' \
Caused by: java.sql.SQLSyntaxErrorException: ORA-00942: table or view does not exist'"
s = ''
f_list = f.replace('.', ' ').split(' ')
for item in f_list:
#print(item)
s = s + ' ' + ''.join(item)+'\n'
print(s)
输出
Exception
in
org
baharan
dominant
dao
core
nonPlanAllocation
INonPlanAllocationRepository
getAllGrid()
with
cause
=
'org
hibernate
exception
SQLGrammarException:
could
not
extract
ResultSet'
Caused
by:
java
sql
SQLSyntaxErrorException:
ORA-00942:
table
or
view
does
not
exist'
我找到了一种简单的方法,可以像这样使用 nltk.tokenize 的 RegexpTokenizer:
>>> from nltk.tokenize import RegexpTokenizer
>>> tokenizer = RegexpTokenizer(r'\w+')
考虑删除停用词后的输出如下:
Exception
org
baharan
dominant
dao
core
nonPlanAllocation
INonPlanAllocationRepository
getAllGrid
cause
org
hibernate
exception
SQLGrammarException
could
extract
ResultSet
Caused
java
sql
SQLSyntaxErrorException
ORA-00942
table
view
exist
我有这样的文字:
Exception in org.baharan.dominant.dao.core.nonPlanAllocation.INonPlanAllocationRepository.getAllGrid()
with cause = 'org.hibernate.exception.SQLGrammarException: could not extract ResultSet'
Caused by: java.sql.SQLSyntaxErrorException: ORA-00942: table or view does not exist
我在 python 中用 word_tokenize 标记了这段文本,输出是:
Exception
org.baharan.dominant.dao.core.nonPlanAllocation.INonPlanAllocationRepository.getAllGrid
cause
'org.hibernate.exception.SQLGrammarException
could
extract
ResultSet'
Caused
java.sql.SQLSyntaxErrorException
ORA-00942
table
view
exist
但是如你所见,第二行输出了几个点缀在一起的单词。如何把它们分开成一个词?!
我使用这个 python 代码:
>>> f = open('001.txt')
>>> text = [w for w in word_tokenize(f.read()) if w not in stopwords]
事实上,我希望所有单词都像这样分开:
Exception
org
baharan
dominant
dao
core
nonPlanAllocation
INonPlanAllocationRepository
getAllGrid
cause
'org
hibernate
exception
SQLGrammarException
could
extract
ResultSet'
Caused
java
sql
SQLSyntaxErrorException
ORA-00942
table
view
exist
f = "Exception in org.baharan.dominant.dao.core.nonPlanAllocation.INonPlanAllocationRepository.getAllGrid() \
with cause = 'org.hibernate.exception.SQLGrammarException: could not extract ResultSet' \
Caused by: java.sql.SQLSyntaxErrorException: ORA-00942: table or view does not exist'"
s = ''
f_list = f.replace('.', ' ').split(' ')
for item in f_list:
#print(item)
s = s + ' ' + ''.join(item)+'\n'
print(s)
输出
Exception
in
org
baharan
dominant
dao
core
nonPlanAllocation
INonPlanAllocationRepository
getAllGrid()
with
cause
=
'org
hibernate
exception
SQLGrammarException:
could
not
extract
ResultSet'
Caused
by:
java
sql
SQLSyntaxErrorException:
ORA-00942:
table
or
view
does
not
exist'
我找到了一种简单的方法,可以像这样使用 nltk.tokenize 的 RegexpTokenizer:
>>> from nltk.tokenize import RegexpTokenizer
>>> tokenizer = RegexpTokenizer(r'\w+')
考虑删除停用词后的输出如下:
Exception
org
baharan
dominant
dao
core
nonPlanAllocation
INonPlanAllocationRepository
getAllGrid
cause
org
hibernate
exception
SQLGrammarException
could
extract
ResultSet
Caused
java
sql
SQLSyntaxErrorException
ORA-00942
table
view
exist