准确的文本生成
Accurate text generation
我有一个可以处理预定义消息的聊天应用程序。该数据库有大约 80 个预定义的对话,每个对话有 5 个可能的响应。为了澄清,这里有一个例子:
Q: "How heavy is a polar bear?"
R1: "Very heavy?"
R2: "Heavy enough to break the ice."
R3: "I don't know. Silly question."
R4: ...
R5: ...
假设用户将选择 R3:"I don't know. Silly question"
那么该响应将有 5 个可能的响应,例如:
R1: "Why is that silly?"
R2: "You're silly!"
R3: "Ugh. I'm done talking to you now."
R4: ...
R5: ...
每个回复都有 5 种可能的回复;之后,对话将结束,必须开始新的对话。
回顾一下,我有 80 个手写对话,每个对话有 5 种可能的回复,深入 3 层 = 总共 10,000 条消息。
我的问题:使用机器学习自动生成更多对话的最准确方法是什么?
我研究了 RNN:Karparthy's RNN post。
虽然RNN可以在旧的基础上做出新的内容,但是新的内容是相当随意和无厘头的。
为了更好地了解这些对话的使用,请访问http://getvene.com/并观看预览视频。
我可能会从生成文本模型开始。有一个不错的 article that uses Python and Keras (you can however use LSTM recurrent neural network also with TensorFlow). With a good and rich set of training data the algorithm can indeed produce pretty interesting text outputs. As mentioned in the article above, there is a Gutenberg 项目,您可以在其中免费找到大量免费书籍。那应该提供足够数量的训练数据。但是,由于您可能已经玩过 RNN,我将继续进行下去。
接下来是 question
和可能的 responses
之间的关系。这告诉我,您的对话中涉及某种语义。这意味着它不是随机的,并且生成的响应至少应该尝试 "fit" 成为某种相关的响应。像 Latent Dirichlet Allocation to find a proper categories and topics based on data but in reversed way - based on topic (question) you need to find out at least somehow relevant data (responses). Perhaps some way of splitting the generated text into many parts and then vectorize these parts and use something like Document Distance algorithm to find the close match? An idea that could also come handy is the Latent Semantic Analysis 因为事实上,从 words/vectors 的矩阵你需要尽可能多地减少矩阵,同时仍然保持相似性。
我建议使用 PPDB http://www.cis.upenn.edu/~ccb/ppdb/ to rephrase your phrases to expand your training data . Check out this paper for example: https://www.aclweb.org/anthology/P/P16/P16-2.pdf#page=177 你可以使用类似的方法来改写每个句子。
我有一个可以处理预定义消息的聊天应用程序。该数据库有大约 80 个预定义的对话,每个对话有 5 个可能的响应。为了澄清,这里有一个例子:
Q: "How heavy is a polar bear?"
R1: "Very heavy?"
R2: "Heavy enough to break the ice."
R3: "I don't know. Silly question."
R4: ...
R5: ...
假设用户将选择 R3:"I don't know. Silly question"
那么该响应将有 5 个可能的响应,例如:
R1: "Why is that silly?"
R2: "You're silly!"
R3: "Ugh. I'm done talking to you now."
R4: ...
R5: ...
每个回复都有 5 种可能的回复;之后,对话将结束,必须开始新的对话。
回顾一下,我有 80 个手写对话,每个对话有 5 种可能的回复,深入 3 层 = 总共 10,000 条消息。
我的问题:使用机器学习自动生成更多对话的最准确方法是什么?
我研究了 RNN:Karparthy's RNN post。 虽然RNN可以在旧的基础上做出新的内容,但是新的内容是相当随意和无厘头的。
为了更好地了解这些对话的使用,请访问http://getvene.com/并观看预览视频。
我可能会从生成文本模型开始。有一个不错的 article that uses Python and Keras (you can however use LSTM recurrent neural network also with TensorFlow). With a good and rich set of training data the algorithm can indeed produce pretty interesting text outputs. As mentioned in the article above, there is a Gutenberg 项目,您可以在其中免费找到大量免费书籍。那应该提供足够数量的训练数据。但是,由于您可能已经玩过 RNN,我将继续进行下去。
接下来是 question
和可能的 responses
之间的关系。这告诉我,您的对话中涉及某种语义。这意味着它不是随机的,并且生成的响应至少应该尝试 "fit" 成为某种相关的响应。像 Latent Dirichlet Allocation to find a proper categories and topics based on data but in reversed way - based on topic (question) you need to find out at least somehow relevant data (responses). Perhaps some way of splitting the generated text into many parts and then vectorize these parts and use something like Document Distance algorithm to find the close match? An idea that could also come handy is the Latent Semantic Analysis 因为事实上,从 words/vectors 的矩阵你需要尽可能多地减少矩阵,同时仍然保持相似性。
我建议使用 PPDB http://www.cis.upenn.edu/~ccb/ppdb/ to rephrase your phrases to expand your training data . Check out this paper for example: https://www.aclweb.org/anthology/P/P16/P16-2.pdf#page=177 你可以使用类似的方法来改写每个句子。