生成器 `max_length` 的 query() 成功
query() of generator `max_length` being succeeded
目标:在 Hugging Face Transformers 生成器查询中设置 min_length
和 max_length
。
我已将 50, 200
作为这些参数传递。然而,我的输出长度要长得多...
没有运行时故障。
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(42)
def query(payload, multiple, min_char_len, max_char_len):
print(min_char_len, max_char_len)
list_dict = generator(payload, min_length=min_char_len, max_length=max_char_len, num_return_sequences=multiple)
test = [d['generated_text'].split(payload)[1].strip() for d in list_dict]
for t in test: print(len(t))
return test
query('example', 1, 50, 200)
输出:
50 200
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
1015
解释:
正如 Narsil 在 Hugging Face Transformers 上所解释的那样 Git Issue response
Models, don't ingest the text one character at a time, but one token
at a time. There are different algorithms to achieve this but
basically "My name is Nicolas" gets transformers into ["my", " name",
" is", " nic", "olas"] for instance, and each of those tokens have a
number.
So when you are generating tokens, they can contain themselves 1 or
more characters (usually several and almost any common word for
instance). That's why you are seeing 1015 instead of your expected 200
(the tokens here have an average of 5 chars)
解决方案:
当我解决...
Rename min_char_len, max_char_len
to min_tokens, max_tokens
and
simply reduce their values by a ~1/4 or 1/5.
目标:在 Hugging Face Transformers 生成器查询中设置 min_length
和 max_length
。
我已将 50, 200
作为这些参数传递。然而,我的输出长度要长得多...
没有运行时故障。
from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(42)
def query(payload, multiple, min_char_len, max_char_len):
print(min_char_len, max_char_len)
list_dict = generator(payload, min_length=min_char_len, max_length=max_char_len, num_return_sequences=multiple)
test = [d['generated_text'].split(payload)[1].strip() for d in list_dict]
for t in test: print(len(t))
return test
query('example', 1, 50, 200)
输出:
50 200
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
1015
解释:
正如 Narsil 在 Hugging Face Transformers 上所解释的那样 Git Issue response
Models, don't ingest the text one character at a time, but one token at a time. There are different algorithms to achieve this but basically "My name is Nicolas" gets transformers into ["my", " name", " is", " nic", "olas"] for instance, and each of those tokens have a number.
So when you are generating tokens, they can contain themselves 1 or more characters (usually several and almost any common word for instance). That's why you are seeing 1015 instead of your expected 200 (the tokens here have an average of 5 chars)
解决方案:
当我解决...
Rename
min_char_len, max_char_len
tomin_tokens, max_tokens
and simply reduce their values by a ~1/4 or 1/5.