将 nlp.pipe() 与带有 spaCy 的预分段和预标记化文本一起使用
Using nlp.pipe() with pre-segmented and pre-tokenized text with spaCy
我正在尝试标记和解析已拆分成句子并已标记化的文本。例如:
sents = [['I', 'like', 'cookies', '.'], ['Do', 'you', '?']]
处理成批文本的最快方法是 .pipe()
。但是,我不清楚如何将其用于预标记和预分段的文本。性能是这里的关键。我尝试了以下操作,但引发了错误
docs = [nlp.tokenizer.tokens_from_list(sentence) for sentence in sents]
nlp.tagger(docs)
nlp.parser(docs)
跟踪:
Traceback (most recent call last):
File "C:\Python\Python37\Lib\multiprocessing\pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "C:\Python\projects\PreDicT\predicting-wte\build_id_dictionary.py", line 204, in process_batch
self.nlp.tagger(docs)
File "pipes.pyx", line 377, in spacy.pipeline.pipes.Tagger.__call__
File "pipes.pyx", line 396, in spacy.pipeline.pipes.Tagger.predict
File "C:\Users\bmvroy\.virtualenvs\predicting-wte-YKqW76ba\lib\site-packages\thinc\neural\_classes\model.py", line 169, in __call__
return self.predict(x)
File "C:\Users\bmvroy\.virtualenvs\predicting-wte-YKqW76ba\lib\site-packages\thinc\neural\_classes\feed_forward.py", line 40, in predict
X = layer(X)
File "C:\Users\bmvroy\.virtualenvs\predicting-wte-YKqW76ba\lib\site-packages\thinc\neural\_classes\model.py", line 169, in __call__
return self.predict(x)
File "C:\Users\bmvroy\.virtualenvs\predicting-wte-YKqW76ba\lib\site-packages\thinc\neural\_classes\model.py", line 133, in predict
y, _ = self.begin_update(X, drop=None)
File "C:\Users\bmvroy\.virtualenvs\predicting-wte-YKqW76ba\lib\site-packages\thinc\neural\_classes\feature_extracter.py", line 14, in begin_update
features = [self._get_feats(doc) for doc in docs]
File "C:\Users\bmvroy\.virtualenvs\predicting-wte-YKqW76ba\lib\site-packages\thinc\neural\_classes\feature_extracter.py", line 14, in <listcomp>
features = [self._get_feats(doc) for doc in docs]
File "C:\Users\bmvroy\.virtualenvs\predicting-wte-YKqW76ba\lib\site-packages\thinc\neural\_classes\feature_extracter.py", line 21, in _get_feats
arr = doc.doc.to_array(self.attrs)[doc.start : doc.end]
AttributeError: 'list' object has no attribute 'doc'
只需将管道中的默认分词器替换为nlp.tokenizer.tokens_from_list
,而不是单独调用它:
import spacy
nlp = spacy.load('en')
nlp.tokenizer = nlp.tokenizer.tokens_from_list
for doc in nlp.pipe([['I', 'like', 'cookies', '.'], ['Do', 'you', '?']]):
for token in doc:
print(token, token.pos_)
输出:
I PRON
like VERB
cookies NOUN
. PUNCT
Do VERB
you PRON
? PUNCT
在 Spacy v3 中,tokens_from_list 不再存在。相反,你这样做:
class YourTokenizer :
def __call__(self, your_doc_object):
return Doc(
nlp.vocab,
words=get_words(your_doc_object),
spaces=get_spaces(your_doc_object)
)
pass // end class
nlp.tokenizer = YourTokenizer()
doc = nlp(your_doc_object)
使用Doc
对象
import spacy
from spacy.tokens import Doc
nlp = spacy.load("en_core_web_sm")
sents = [['I', 'like', 'cookies', '.'], ['Do', 'you', '?']]
for sent in sents:
doc = Doc(nlp.vocab, sent)
for token in nlp(doc):
print(token.text, token.pos_)
我正在尝试标记和解析已拆分成句子并已标记化的文本。例如:
sents = [['I', 'like', 'cookies', '.'], ['Do', 'you', '?']]
处理成批文本的最快方法是 .pipe()
。但是,我不清楚如何将其用于预标记和预分段的文本。性能是这里的关键。我尝试了以下操作,但引发了错误
docs = [nlp.tokenizer.tokens_from_list(sentence) for sentence in sents]
nlp.tagger(docs)
nlp.parser(docs)
跟踪:
Traceback (most recent call last):
File "C:\Python\Python37\Lib\multiprocessing\pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "C:\Python\projects\PreDicT\predicting-wte\build_id_dictionary.py", line 204, in process_batch
self.nlp.tagger(docs)
File "pipes.pyx", line 377, in spacy.pipeline.pipes.Tagger.__call__
File "pipes.pyx", line 396, in spacy.pipeline.pipes.Tagger.predict
File "C:\Users\bmvroy\.virtualenvs\predicting-wte-YKqW76ba\lib\site-packages\thinc\neural\_classes\model.py", line 169, in __call__
return self.predict(x)
File "C:\Users\bmvroy\.virtualenvs\predicting-wte-YKqW76ba\lib\site-packages\thinc\neural\_classes\feed_forward.py", line 40, in predict
X = layer(X)
File "C:\Users\bmvroy\.virtualenvs\predicting-wte-YKqW76ba\lib\site-packages\thinc\neural\_classes\model.py", line 169, in __call__
return self.predict(x)
File "C:\Users\bmvroy\.virtualenvs\predicting-wte-YKqW76ba\lib\site-packages\thinc\neural\_classes\model.py", line 133, in predict
y, _ = self.begin_update(X, drop=None)
File "C:\Users\bmvroy\.virtualenvs\predicting-wte-YKqW76ba\lib\site-packages\thinc\neural\_classes\feature_extracter.py", line 14, in begin_update
features = [self._get_feats(doc) for doc in docs]
File "C:\Users\bmvroy\.virtualenvs\predicting-wte-YKqW76ba\lib\site-packages\thinc\neural\_classes\feature_extracter.py", line 14, in <listcomp>
features = [self._get_feats(doc) for doc in docs]
File "C:\Users\bmvroy\.virtualenvs\predicting-wte-YKqW76ba\lib\site-packages\thinc\neural\_classes\feature_extracter.py", line 21, in _get_feats
arr = doc.doc.to_array(self.attrs)[doc.start : doc.end]
AttributeError: 'list' object has no attribute 'doc'
只需将管道中的默认分词器替换为nlp.tokenizer.tokens_from_list
,而不是单独调用它:
import spacy
nlp = spacy.load('en')
nlp.tokenizer = nlp.tokenizer.tokens_from_list
for doc in nlp.pipe([['I', 'like', 'cookies', '.'], ['Do', 'you', '?']]):
for token in doc:
print(token, token.pos_)
输出:
I PRON
like VERB
cookies NOUN
. PUNCT
Do VERB
you PRON
? PUNCT
在 Spacy v3 中,tokens_from_list 不再存在。相反,你这样做:
class YourTokenizer :
def __call__(self, your_doc_object):
return Doc(
nlp.vocab,
words=get_words(your_doc_object),
spaces=get_spaces(your_doc_object)
)
pass // end class
nlp.tokenizer = YourTokenizer()
doc = nlp(your_doc_object)
使用Doc
对象
import spacy
from spacy.tokens import Doc
nlp = spacy.load("en_core_web_sm")
sents = [['I', 'like', 'cookies', '.'], ['Do', 'you', '?']]
for sent in sents:
doc = Doc(nlp.vocab, sent)
for token in nlp(doc):
print(token.text, token.pos_)