如何标记句子和单词中的大文本
how to tokenize big text in sentences and words
我正在使用葡萄牙语的 nltk 工作。
这是我的文字:
import numpy as np
from nltk.corpus import machado, mac_morpho, floresta, genesis
from nltk.text import Text
ptext1 = Text(machado.words('romance/marm05.txt'), name="Memórias Póstumas de Brás Cubas (1881)")
ptext2 = Text(machado.words('romance/marm08.txt'), name="Dom Casmurro (1899)")
ptext3 = Text(genesis.words('portuguese.txt'), name="Gênesis")
ptext4 = Text(mac_morpho.words('mu94se01.txt'), name="Folha de Sao Paulo (1994)")
例如,我想将 ptext4 分成句子,然后我想分成单词:
sentencas = nltk.sent_tokenize(ptext4)
palavras = nltk.word_tokenize(ptext4)
但它不起作用:错误应该是字符串或类似字节的对象
我试过这个:
sentencas = [row for row in nltk.sent_tokenize(row)]
但结果不是预期的:
[In]sentencas
[Out] ['Fujimori']
请问我该怎么办?我是新手。
word_token = list(pytext1) # if you want to have only word token from pytext1
print(word_token[0:10]) # printing first 10 token
#op
['Romance',',','Memórias','Póstumas','de','Brás','Cubas',',','1880','Memórias']
#if you want sent_token of text using sent_tokenize, read textfile in raw form
raw_text = machado.raw('romance/marm05.txt')
print(raw_text[0:100]) # printing first 100 character from sentence
#op
'Romance, Memórias Póstumas de Brás Cubas, 1880\n\nMemórias Póstumas de\nBrás Cubas\n\nTexto-fonte:\nObra C'
sent_token = nltk.sent_tokenize(raw_text)
print(sent_token[0:2]) # printing 2 sentence, which is tokenized from text
['Romance, Memórias Póstumas de Brás Cubas, 1880\n\nMemórias Póstumas de\nBrás
Cubas\n\nTexto-fonte:\nObra Completa, Machado de\nAssis,\nRio\nde Janeiro: Editora
Nova Aguilar, 1994.',
'Publicado originalmente em\nfolhetins, a partir de março de 1880, na Revista Brasileira.']
如果您只需要 machado
语料库中的单词列表,请使用 .words()
函数。
>>> from nltk.corpus import machado
>>> machado.words()
但是如果你想处理原始文本,例如
>>> text = machado.raw('romance/marm08.txt')
>>> print(text)
使用这个成语
>>> from nltk import word_tokenize, sent_tokenize
>>> text = machado.raw('romance/marm08.txt')
>>> tokenized_text = [word_tokenize(sent) for sent in sent_tokenize(text)]
并遍历 tokenized_text
,这是一个 list(list(str))
,请执行以下操作:
>>> for sent in tokenize_text:
... for word in sent:
... print(word)
... break
...
那么,根据@qaiser和@alvas,有两种方法可以回答这个问题。这两个答案都以不同的方式解决了问题。
第二个答案有负代码行:
import numpy as np
from nltk.corpus import machado
import nltk
#if you want sent_token of text using sent_tokenize, read textfile in raw form
raw_text = machado.raw('romance/marm05.txt')
word_token = nltk.word_tokenize(raw_text)
sent_token = nltk.sent_tokenize(raw_text)
[In]:print(sent_token[0:2]) # printing 2 sentence, which is tokenized from text
[Out]: ['Romance, Memórias Póstumas de Brás Cubas, 1880\n\nMemórias Póstumas de\nBrás Cubas\n\nTexto-fonte:\nObra Completa, Machado de\nAssis,\nRio\nde Janeiro: Editora Nova Aguilar, 1994.', 'Publicado originalmente em\nfolhetins, a partir de março de 1880, na Revista Brasileira.']
[In]:print(word_token[0:20]) # printing 20 words, wich is tokenized from text
[Out]:['Romance', ',', 'Memórias', 'Póstumas', 'de', 'Brás', 'Cubas', ',', '1880', 'Memórias', 'Póstumas', 'de', 'Brás', 'Cubas', 'Texto-fonte', ':', 'Obra', 'Completa', ',', 'Machado']
我正在使用葡萄牙语的 nltk 工作。
这是我的文字:
import numpy as np
from nltk.corpus import machado, mac_morpho, floresta, genesis
from nltk.text import Text
ptext1 = Text(machado.words('romance/marm05.txt'), name="Memórias Póstumas de Brás Cubas (1881)")
ptext2 = Text(machado.words('romance/marm08.txt'), name="Dom Casmurro (1899)")
ptext3 = Text(genesis.words('portuguese.txt'), name="Gênesis")
ptext4 = Text(mac_morpho.words('mu94se01.txt'), name="Folha de Sao Paulo (1994)")
例如,我想将 ptext4 分成句子,然后我想分成单词:
sentencas = nltk.sent_tokenize(ptext4)
palavras = nltk.word_tokenize(ptext4)
但它不起作用:错误应该是字符串或类似字节的对象
我试过这个:
sentencas = [row for row in nltk.sent_tokenize(row)]
但结果不是预期的:
[In]sentencas
[Out] ['Fujimori']
请问我该怎么办?我是新手。
word_token = list(pytext1) # if you want to have only word token from pytext1
print(word_token[0:10]) # printing first 10 token
#op
['Romance',',','Memórias','Póstumas','de','Brás','Cubas',',','1880','Memórias']
#if you want sent_token of text using sent_tokenize, read textfile in raw form
raw_text = machado.raw('romance/marm05.txt')
print(raw_text[0:100]) # printing first 100 character from sentence
#op
'Romance, Memórias Póstumas de Brás Cubas, 1880\n\nMemórias Póstumas de\nBrás Cubas\n\nTexto-fonte:\nObra C'
sent_token = nltk.sent_tokenize(raw_text)
print(sent_token[0:2]) # printing 2 sentence, which is tokenized from text
['Romance, Memórias Póstumas de Brás Cubas, 1880\n\nMemórias Póstumas de\nBrás
Cubas\n\nTexto-fonte:\nObra Completa, Machado de\nAssis,\nRio\nde Janeiro: Editora
Nova Aguilar, 1994.',
'Publicado originalmente em\nfolhetins, a partir de março de 1880, na Revista Brasileira.']
如果您只需要 machado
语料库中的单词列表,请使用 .words()
函数。
>>> from nltk.corpus import machado
>>> machado.words()
但是如果你想处理原始文本,例如
>>> text = machado.raw('romance/marm08.txt')
>>> print(text)
使用这个成语
>>> from nltk import word_tokenize, sent_tokenize
>>> text = machado.raw('romance/marm08.txt')
>>> tokenized_text = [word_tokenize(sent) for sent in sent_tokenize(text)]
并遍历 tokenized_text
,这是一个 list(list(str))
,请执行以下操作:
>>> for sent in tokenize_text:
... for word in sent:
... print(word)
... break
...
那么,根据@qaiser和@alvas,有两种方法可以回答这个问题。这两个答案都以不同的方式解决了问题。 第二个答案有负代码行:
import numpy as np
from nltk.corpus import machado
import nltk
#if you want sent_token of text using sent_tokenize, read textfile in raw form
raw_text = machado.raw('romance/marm05.txt')
word_token = nltk.word_tokenize(raw_text)
sent_token = nltk.sent_tokenize(raw_text)
[In]:print(sent_token[0:2]) # printing 2 sentence, which is tokenized from text
[Out]: ['Romance, Memórias Póstumas de Brás Cubas, 1880\n\nMemórias Póstumas de\nBrás Cubas\n\nTexto-fonte:\nObra Completa, Machado de\nAssis,\nRio\nde Janeiro: Editora Nova Aguilar, 1994.', 'Publicado originalmente em\nfolhetins, a partir de março de 1880, na Revista Brasileira.']
[In]:print(word_token[0:20]) # printing 20 words, wich is tokenized from text
[Out]:['Romance', ',', 'Memórias', 'Póstumas', 'de', 'Brás', 'Cubas', ',', '1880', 'Memórias', 'Póstumas', 'de', 'Brás', 'Cubas', 'Texto-fonte', ':', 'Obra', 'Completa', ',', 'Machado']