GridsearchCV：尝试在参数中传递 lambda 时无法 pickle 函数错误

Question

我在 Whosebug 和其他地方进行了广泛的研究，但似乎找不到以下问题的答案。

我正在尝试修改函数的参数，该函数本身就是 GridSearchCV function of sklearn. More specifically, I want to change parameters (herepreserve_case = False) inside thecasual_tokenizefunction that is passed to the parametertokenizer 中的参数of the functionCountVectorizer`。

具体代码如下：

from sklearn.datasets import fetch_20newsgroups
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from nltk import casual_tokenize

正在从 20newsgroup 生成虚拟数据

categories = ['alt.atheism', 'comp.graphics', 'sci.med', 
              'soc.religion.christian']
twenty_train = fetch_20newsgroups(subset='train',
                               categories=categories,
                               shuffle=True,
                               random_state=42)

正在创建分类管道。
请注意，可以使用 lambda 修改分词器。我想知道是否有另一种方法可以做到这一点，因为它不适用于 GridSearchCV 。

text_clf = Pipeline([('vect',
                      CountVectorizer(tokenizer=lambda text:
                                     casual_tokenize(text, 
                                     preserve_case=False))),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB()),
                    ])

text_clf.fit(twenty_train.data, twenty_train.target) # this works fine

然后我想将 CountVectorizer 的默认分词器与 nltk 中的分词器进行比较。请注意，我问这个问题是因为我想比较多个分词器，每个分词器都有需要指定的特定参数。

parameters = {'vect':[CountVectorizer(),
                       CountVectorizer(tokenizer=lambda text:
                                       casual_tokenize(text, 
                                       preserve_case=False))]}

gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1, cv=5)
gs_clf = gs_clf.fit(twenty_train.data[:100], twenty_train.target[:100])

gs_clf.fit 给出以下错误：PicklingError: Can't pickle at 0x1138c5598>: attribute lookup on main failed

所以我的问题是：
1) 有谁知道如何用 GridSearchCV.
专门处理这个问题 2) 是否有更好的 pythonic 方式来处理将参数传递给也将是参数的函数？

Answer 1

1) Does anybody know how to deal with this issue specifically with GridSearchCV.

您可以使用 partial 代替 lambda

from functools import partial
from sklearn.externals.joblib import dump

def add(a, b):
    return a + b

plus_one = partial(add, b=1)
plus_one_lambda = lambda a: a + 1
dump(plus_one, 'add.pkl')          # No problem
dump(plus_one_lambda, 'add.pkl')   # Pickling error

对于你的情况：

tokenizer=partial(casual_tokenize, preserve_case=False)

2) Is there a better pythonic way of dealing with passing parameters to a function that will also be a parameter ?

我认为使用lambda或partial都是"pythonic ways"。

这里的问题是 GridSearchCV 使用多处理。这意味着它可能会启动多个进程，它必须在一个进程中序列化参数并将它们传递给其他进程（然后目标进程反序列化以获得相同的参数）。

GridSearchCV 使用 joblib 进行多处理/序列化。 Joblib 无法处理 lambda 个函数。

GridsearchCV：尝试在参数中传递 lambda 时无法 pickle 函数错误

GridsearchCV: can't pickle function error when trying to pass lambda in parameter

python

scikit-learn

grid-search