langid.py 如何在代码中将模型二进制文件创建为字符串？

Question

Langid.py 是一个流行的语言检测库。

在库的 langid.py file 中，有一种特殊的方法可以在 Python 代码

中对二进制文件进行编码

model=b"""
QlpoOTFBWSZTWRcOrWEAUJJfgGAQAMN...
"""

并且模型由调用 LanguageIdentifier.from_modelstring()

的 load_model() 函数加载

def load_model(path = None):
  """
  Convenience method to set the global identifier using a model at a
  specified path.

  @param path to model
  """
  global identifier
  logger.info('initializing identifier')
  if path is None:
    identifier = LanguageIdentifier.from_modelstring(model)
  else:
    identifier = LanguageIdentifier.from_modelpath(path)

class LanguageIdentifier(object):
  """
  This class implements the actual language identifier.
  """

  @classmethod
  def from_modelstring(cls, string, *args, **kwargs):
    b = base64.b64decode(string)
    z = bz2.decompress(b)
    model = loads(z)
    nb_ptc, nb_pc, nb_classes, tk_nextmove, tk_output = model
    nb_numfeats = int(len(nb_ptc) / len(nb_pc))

    # reconstruct pc and ptc
    nb_pc = np.array(nb_pc)
    nb_ptc = np.array(nb_ptc).reshape(nb_numfeats, len(nb_pc))

模型的二进制字符串是如何创建的？

是否有其他 examples/library of saving/loading 模型和二进制文件以类似的方式？

Answer 1

您可以通过简单地查看序列化过程的解码方式来对序列化过程进行逆向工程。

很明显，b64decode -> decompress -> loads 操作正在发生。此外，pickle 加载的对象显然是列表列表、numpy 数组或其他 python 对象的混合。

由此看来，如果我们把操作安排相反，那么也许dumps -> compress和b64encode可能被使用了？

import numpy as np
from pickle import loads, dumps
import bz2
import base64

# I don't actually know what model contains, but they most definitely have
# at least two numpy arrays. Since they need to call np.array( ) around the 
# recovered objects, the arrays were also most likely converted to lists
# beforehand.

model = [
    np.random.randn(50).tolist(),
    np.random.randn(10).tolist(),
    np.random.randn(5).tolist(),
    100,
    ["hello", "world"]
]

def serialize(model):
    serialized_str = dumps(model)
    serialized_str = bz2.compress(serialized_str)
    serialized_str = base64.b64encode(serialized_str)
    return serialized_str

serialized_model = serialize(model)
print(serialized_model)

这会创建如下内容：

b'QlpoOTFBWSZTWZOjK1MAAKt////f/+/3/9v+/v2b5/7//fl+/9fv/f3xbv//7//f/Pf/sAGZYmIygxNAAADQZ...

让我们尝试使用它们的函数从这个字符串中检索我们的东西：

def from_modelstring(string):
    b = base64.b64decode(string)
    z = bz2.decompress(b)
    model = loads(z)
    nb_ptc, nb_pc, nb_classes, tk_nextmove, tk_output = model
    nb_numfeats = int(len(nb_ptc) / len(nb_pc))

    # reconstruct pc and ptc
    nb_pc = np.array(nb_pc)
    nb_ptc = np.array(nb_ptc).reshape(nb_numfeats, len(nb_pc))
    return [nb_ptc, nb_pc, nb_classes, tk_nextmove, tk_output]

retrieved_model = from_modelstring(serialized_model)
for stuff in retrieved_model:
    print(stuff)

这几乎推出了我们之前序列化的内容（除了他们重塑了第一个数组）。

[[ 1.05455975  1.47333935  1.37442491 -0.02935783 -0.24073724 -1.49982221
  -0.20562748  1.00599094 -0.39817881  2.38135877]
 [-2.26547287  0.40649275 -0.42671883 -0.14154335  0.16647036 -0.4369942
   0.56737926  0.84126397 -1.80242939 -0.46906909]
 [ 0.24276755  1.00126493  0.42857048 -2.27383095 -0.39111637  1.72592306
  -0.41461467  2.70302884  0.21227391 -1.53374656]
 [-0.3529697  -0.58519877  0.01826065 -0.27764779 -0.36591068  0.01622645
   1.13080176  2.06702545  0.97302083 -0.32730124]
 [ 0.74106848  0.41801277  1.10355551 -0.46584239  1.08019501 -0.30003819
  -0.22321621 -0.66239601  0.87712623 -0.97101542]]
[ 0.45764773 -1.2864595   0.63190841 -0.70456336 -1.47569178  0.71870362
  0.1655068  -0.80424568 -0.64359963  1.38405498]
[2.1449875114268377, 0.4234196905494024, -0.27539676193149465, 0.37630564468835975, -0.1623772359974499]
100
['hello', 'world']

我不确定其他库是否完全以这种方式存储模型，但其中相当一部分（如果不是几乎全部）肯定使用 pickle 来二值化和转储他们的模型。例如pytorch在torch.save.

中使用pickle和zipping的组合

langid.py 如何在代码中将模型二进制文件创建为字符串？

How did langid.py create the model binary as strings in code?

python

binary

nlp

numpy

scikit-learn