如何在不出现类型错误的情况下将文本数据标记为单词和句子
How do I tokenize a text data into words and sentences without getting a type error
我的最终目标是使用 NER 模型来识别自定义实体。在这样做之前,我将文本数据标记为单词和句子。我有一个文本文件 (.txt) 文件夹,我使用 os 库打开并读入 Jupyter。读取文本文件后,每当我尝试标记文本文件时,都会出现类型错误。请告知我做错了什么?我的代码在下面,谢谢。
import os
outfile = open('result.txt', 'w')
path = "C:/Users/okeke/Documents/Work flow/IT Text analytics Project/Extract/Dubuque_text-nlp"
files = os.listdir(path)
for file in files:
outfile.write(str(os.stat(path + "/" + file).st_size) + '\n')
outfile.close()
这段代码 运行 很好,每当我 运行 输出文件时,我都会在下面得到这个
outfile
<_io.TextIOWrapper name='result.txt' mode='w' encoding='cp1252'>
接下来,标记化。
from nltk.tokenize import sent_tokenize, word_tokenize
sent_tokens = sent_tokenize(outfile)
print(outfile)
word_tokens = word_tokenize(outfile)
print(outfile
但是在 运行 上面的代码之后我得到了一个错误。检查下面的错误
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-22-62f66183895a> in <module>
1 from nltk.tokenize import sent_tokenize, word_tokenize
----> 2 sent_tokens = sent_tokenize(outfile)
3 print(outfile)
4
5 #word_tokens = word_tokenize(text)
~\AppData\Local\Continuum\anaconda3\envs\nlp_course\lib\site-packages\nltk\tokenize\__init__.py in sent_tokenize(text, language)
93 """
94 tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
---> 95 return tokenizer.tokenize(text)
96
97 # Standard word tokenizer.
TypeError: expected string or bytes-like object
(移动评论回答)
您正在尝试处理文件对象而不是文件中的文本。创建文本文件后,re-open 它并在标记化之前读取整个文件。
试试这个代码:
import os
outfile = open('result.txt', 'w')
path = "C:/Users/okeke/Documents/Work flow/IT Text analytics Project/Extract/Dubuque_text-nlp"
files = os.listdir(path)
for file in files:
with open(path + "/" + file) as f:
outfile.write(f.read() + '\n')
#outfile.write(str(os.stat(path + "/" + file).st_size) + '\n')
outfile.close() # done writing
from nltk.tokenize import sent_tokenize, word_tokenize
with open('result.txt') as outfile: # open for read
alltext = outfile.read() # read entire file
print(alltext)
sent_tokens = sent_tokenize(alltext) # process file text. tokenize sentences
word_tokens = word_tokenize(alltext) # process file text. tokenize words
我的最终目标是使用 NER 模型来识别自定义实体。在这样做之前,我将文本数据标记为单词和句子。我有一个文本文件 (.txt) 文件夹,我使用 os 库打开并读入 Jupyter。读取文本文件后,每当我尝试标记文本文件时,都会出现类型错误。请告知我做错了什么?我的代码在下面,谢谢。
import os
outfile = open('result.txt', 'w')
path = "C:/Users/okeke/Documents/Work flow/IT Text analytics Project/Extract/Dubuque_text-nlp"
files = os.listdir(path)
for file in files:
outfile.write(str(os.stat(path + "/" + file).st_size) + '\n')
outfile.close()
这段代码 运行 很好,每当我 运行 输出文件时,我都会在下面得到这个
outfile
<_io.TextIOWrapper name='result.txt' mode='w' encoding='cp1252'>
接下来,标记化。
from nltk.tokenize import sent_tokenize, word_tokenize
sent_tokens = sent_tokenize(outfile)
print(outfile)
word_tokens = word_tokenize(outfile)
print(outfile
但是在 运行 上面的代码之后我得到了一个错误。检查下面的错误
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-22-62f66183895a> in <module>
1 from nltk.tokenize import sent_tokenize, word_tokenize
----> 2 sent_tokens = sent_tokenize(outfile)
3 print(outfile)
4
5 #word_tokens = word_tokenize(text)
~\AppData\Local\Continuum\anaconda3\envs\nlp_course\lib\site-packages\nltk\tokenize\__init__.py in sent_tokenize(text, language)
93 """
94 tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
---> 95 return tokenizer.tokenize(text)
96
97 # Standard word tokenizer.
TypeError: expected string or bytes-like object
(移动评论回答)
您正在尝试处理文件对象而不是文件中的文本。创建文本文件后,re-open 它并在标记化之前读取整个文件。
试试这个代码:
import os
outfile = open('result.txt', 'w')
path = "C:/Users/okeke/Documents/Work flow/IT Text analytics Project/Extract/Dubuque_text-nlp"
files = os.listdir(path)
for file in files:
with open(path + "/" + file) as f:
outfile.write(f.read() + '\n')
#outfile.write(str(os.stat(path + "/" + file).st_size) + '\n')
outfile.close() # done writing
from nltk.tokenize import sent_tokenize, word_tokenize
with open('result.txt') as outfile: # open for read
alltext = outfile.read() # read entire file
print(alltext)
sent_tokens = sent_tokenize(alltext) # process file text. tokenize sentences
word_tokens = word_tokenize(alltext) # process file text. tokenize words