生成器 `max_length` 的 query() 成功

Question

目标：在 Hugging Face Transformers 生成器查询中设置 min_length 和 max_length。

我已将 50, 200 作为这些参数传递。然而，我的输出长度要长得多...

没有运行时故障。

from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(42)

def query(payload, multiple, min_char_len, max_char_len):
    print(min_char_len, max_char_len)
    list_dict = generator(payload, min_length=min_char_len, max_length=max_char_len, num_return_sequences=multiple)
    test = [d['generated_text'].split(payload)[1].strip() for d in list_dict]
    for t in test: print(len(t))
    return test

query('example', 1, 50, 200)

输出：

50 200
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
1015

Answer 1

解释：

正如 Narsil 在 Hugging Face Transformers 上所解释的那样 Git Issue response

Models, don't ingest the text one character at a time, but one token at a time. There are different algorithms to achieve this but basically "My name is Nicolas" gets transformers into ["my", " name", " is", " nic", "olas"] for instance, and each of those tokens have a number.

So when you are generating tokens, they can contain themselves 1 or more characters (usually several and almost any common word for instance). That's why you are seeing 1015 instead of your expected 200 (the tokens here have an average of 5 chars)

解决方案：

当我解决...

Rename min_char_len, max_char_len to min_tokens, max_tokens and simply reduce their values by a ~1/4 or 1/5.

生成器 `max_length` 的 query() 成功

query() of generator `max_length` being succeeded

python-3.x

huggingface-transformers

解释：

解决方案：