API 序列上的 Word2Vec + LSTM

Question

我正在尝试将 word2Vec 和 LSTM 应用于包含文件的 API 跟踪日志的数据集，其中包括 API 函数调用及其用于二进制分类的参数。

数据看起来像：

File_ID,    Label,   API Trace log
 1,           M,      kernel32 LoadLibraryA kernel32.dll
                      kernel32 GetProcAddress MZ\x90 ExitProcess
                      ...

 2,           V,     kernel32 GetModuleHandleA RPCRT4.dll
                     kernel32 GetCurrentThreadId d\x8B\x0D0 POINTER POINTER
                     ...

API跟踪包括：模块名，API函数名，参数（以空格space分隔）

以文件1的第一个API轨迹为例，kernel32为模块名，LoadLibraryA为函数名，kernel32.dll为参数。每条API条轨迹之间用\n隔开，这样每一行依次代表一个API条序列信息。

首先我根据所有API trace log的line sentence训练了一个word2vec模型。大约有 5k API 个函数调用，例如LoadLibraryA，GetProcAddress。然而，由于参数值可能会发生变化，模型在包含这些参数后变得相当大（有 300,000 个词汇）。

之后，我通过应用word2vec的embedding_wrights训练了一个LSTM，模型结构如下：

model = Sequential() 
model.add(Embedding(output_dim=vocab_dim, input_dim=n_symbols, \
                mask_zero=False, weights=[embedding_weights], \
                trainable=False))
model.add(LSTM(dense_dim,kernel_initializer='he_normal', dropout=0.15, 
recurrent_dropout=0.15, implementation=2))
model.add(Dropout(0.3))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=epochs, batch_size=batch_size, callbacks=[early_stopping, parallel_check_cb])

我得到的方法embedding_weights是为word2vec模型中的每个词汇创建一个矩阵，将模型中单词的索引映射到它的向量

def create_embedding_weights(model, max_index=0):
    # dimensionality of your word vectors
    num_features = len(model[model.vocab.keys()[0]])
    n_symbols = len(model.vocab) + 1  # adding 1 to account for 0th index (for masking)
    # Only word2vec feature set
    embedding_weights = np.zeros((max(n_symbols + 1, max_index + 1), num_features))
    for word, value in model.vocab.items():
        embedding_weights[value.index, :] = model[word]

    return embedding_weights

对于训练数据，我所做的是对于API调用中的每个单词，将实际单词转换为word2vec模型的索引，使其与上面embedding_weights中的索引一致。 e.g. kernel32 -> 0, LoadLibraryA -> 1, kernel32.dll -> 2. GetProcAddress -> 4, MZ\x90 -> 5, ExitProcess ->6

所以文件 1 的训练数据看起来像 [0, 1, 2, 3, 4, 5, 6]。请注意，我没有为每个 API 轨迹进行线分割。结果，模型可能不知道 API trace 的起点和终点在哪里？而且模型的训练准确率很差——准确率为 50% :(

我的问题是，在准备训练和验证数据集时，在将实际单词映射到它们的索引时是否也应该分割线？然后将上面的训练数据改成如下，每个API trace用一行隔开，可能把缺失值补到-1，word2vec的索引中不存在。

[[0, 1, 2, -1]
 [3, 4, 5, 6]]

同时我使用非常简单的结构进行训练，而 word2vec 模型相当大，任何关于结构的建议也将不胜感激。

Answer 1

我至少会把轨迹线分成三部分：

模块（制作字典和嵌入）
函数（制作字典和嵌入）
参数（制作字典和嵌入 - 稍后查看详细信息）

由于这是一个非常具体的应用程序，我认为最好保持嵌入可训练（嵌入的全部意义在于创建有意义的向量，而意义在很大程度上取决于将要训练的模型使用它们。问题：你是如何创建 word2vec 模型的？它从哪些数据中学习？）。

这个模型会有更多的输入。它们都是从零到最大字典索引的整数。考虑使用 mask_zero=True 并将所有文件填充到 maxFileLines。

moduleInput = Input(maxFileLines,) 
functionInput = Input(maxFileLines,)

对于参数，我可能会创建一个子序列，就好像参数列表是一个句子一样。（同样，mask_zero=True，并填充到 maxNumberOfParameters）

parametersInput = Input(maxFileLines, maxNumberOfParameters)

函数和模块嵌入：

moduleEmb = Embedding(.....mask_zero=True,)(moduleInput)    
functionEmb = Embedding(.....mask_zero=True)(functionInput)

现在，对于参数，我想创建一个序列序列（也许这太多了）。为此，我首先将行维度转移到批次维度并仅使用 length = maxNumberOfParameters:

paramEmb = Lambda(lambda x: K.reshape(x,(-1,maxNumberOfParameters)))(parametersInput)
paramEmb = Embedding(....,mask_zero=True)(paramEmb)
paramEmb = Lambda(lambda x: K.reshape(x,(-1,maxFileLines,embeddingSize)))(paramEmb)

现在我们在最后一个维度中连接所有这些，我们准备好进入 LSTM：

joinedEmbeddings = Concatenate()([moduleEmb,functoinEmb,paramEmb])
out = LSTM(...)(joinedEmbeddings)
out = ......

model = Model([moduleInput,functionInput,parametersInput], out)

如何准备输入

使用此模型，您需要三个独立的输入。一种用于模块，一种用于功能，一种用于参数。

这些输入将仅包含索引（无向量）。而且他们不需要以前的 word2vec 模型。嵌入是 word2vec 转换器。

因此，获取文件行并拆分。首先我们用逗号分隔，然后我们用 spaces:

分隔 API 调用

import numpy as np

#read the file
loadedFile = open(fileName,'r')
allLines = [l.strip() for l in loadedFile.readlines()] 
loadedFile.close()

#split by commas
splitLines = []
for l in allLines[1:]: #use 1 here only if you have headers in the file
    splitLines.append (l.split(','))
splitLines = np.array(splitLines)

#get the split values and separate ids, targets and calls
ids = splitLines[:,0]
targets = splitLines[:,1]
calls = splitLines[:,2]

#split the calls by space, adding dummy parameters (spaces) to the max length
splitCalls = []
for c in calls:
    splitC = c.strip().split(' ')

    #pad the parameters (space for dummy params)
    for i in range(len(splitC),maxParams+2):
        splitC.append(' ') 

    splitCalls.append(splitC)

splitCalls = np.array(splitCalls)

modules = splitCalls[:,0]
functions = splitCalls[:,1]
parameters = splitCalls[:,2:] #notice the parameters have an extra dimension

现在让我们制作索引：

modIndices, modCounts = np.unique(modules,return_counts=True)
funcIndices, funcCounts = np.unique(functions,return_counts=True)

#for de parameters, let's flatten the array first (because we have 2 dimensions)
flatParams = parameters.reshape((parameters.shape[0]*parameters.shape[1],))
paramIndices, paramCounts = np.unique(flatParams,return_counts=True)

这些将创建一个独特的单词列表并获取它们的计数。在这里您可以自定义要在 "another word" class 中分组的单词。（也许基于计数，如果计数太少，则将其设为"another word"）。

接下来我们来制作字典：

def createDic(uniqueWords):
    dic = {}
    for i,word in enumerate(uniqueWords):
         dic[word] = i + 1 # +1 because we want to reserve the zeros for padding     
    return dic

请注意参数，因为我们在那里使用了虚拟 space：

moduleDic = createDic(modIndices)
funcDic = createDic(funcIndices)
paramDic = createDic(paramIndices[1:]) #make sure the space got the first position here    
paramDic[' '] = 0

好了，现在我们只是替换原来的值：

moduleData = [moduleDic[word] for word in modules]
funcData = [funcDic[word] for word in functions]
paramData = [[paramDic[word] for word in paramLine] for paramLine in parameters]

填充它们：

for i in range(len(moduleData),maxFileLines):
    moduleData.append(0)
    funcData.append(0)
    paramData.append([0] * maxParams)

对每个文件执行此操作，并存储在文件列表中：

moduleTrainData = []  
functionTrainData = []
paramTrainData = []
for each file do the above and:
    moduleTrainData.append(moduleData)
    functionTrainData.append(funcData)
    paramTrainData.append(paramData)

moduleTrainData = np.asarray(moduleTrainData)
functionTrainData = np.asarray(functionTrainData)
paramTrainData = np.asarray(paramTrainData)

这就是输入的全部内容。

model.fit([moduleTrainData,functionTrainData,paramTrainData],outputLabels,...)

API 序列上的 Word2Vec + LSTM

Word2Vec + LSTM on API Sequence

word2vec

deep-learning

lstm

keras

tensorflow

如何准备输入