八卦比较
Trigrams comparison
我对编码还很陌生,所以我想我自己没有看到明显的答案,所以如果这是一个愚蠢的问题,我很抱歉,但我真的被困在这里了。我正在尝试比较来自两个不同文本(A 和 B)的两组三元组。如果 A 上没有 B 中的八卦,那么我会说这两个文本是不同的,至少就我目前的目的而言。我正在使用 Nuve 提取三元组。
到目前为止我有这个:
var paragraph = "This is not a phrase. This is not a sentence.";
var paragraph2 = "This is a phrase. This is a sentence. This have nothing to do with sentences.";
ITokenizer tokenizer = new ClassicTokenizer(true);
SentenceSegmenter segmenter = new TokenBasedSentenceSegmenter(tokenizer);
var sentences = segmenter.GetSentences(paragraph);
ITokenizer tokenizer2 = new ClassicTokenizer(true);
SentenceSegmenter segmenter2 = new TokenBasedSentenceSegmenter(tokenizer2);
var sentences2 = segmenter2.GetSentences(paragraph2);
var extractor = new NGramExtractor(3);
var grams1 = extractor.ExtractAsList(sentences);
var grams2 = extractor.ExtractAsList(sentences2);
var nonintersect = grams2.Except(grams1);
foreach (var nGram in nonintersect)
{
var current = nGram;
bool found = false;
foreach (var n in grams2)
{
if (!found)
{
if (n == current)
{
found = true;
}
}
}
if (!found)
{
var result = current;
string finalresult = Convert.ToString(result);
textBox3.AppendText(finalresult+ "\n");
}
通过这种方式,我希望得到在 B 中不存在于 A 中的句子(即示例中 B 中的所有句子),但现在我必须将 B 中的每个三元组与每个三元组进行比较从 A 看句子之间是否真的不同。我试图用另一个嵌套的 foreach 这样做,但我得到的只是无意义的数据,如下所示:
foreach (var sentence2 in sentences2)
{
var actual = sentence2;
bool found1 = false;
foreach (var sentence in sentences)
{
if (!found1)
{
if (actual == sentence)
{
found1 = true;
}
}
}
if (!found1)
{
string finalresult= Convert.ToString(actual);
textBox3.AppendText(finalresult+ "\n");
}
}
这样做我尝试验证 B 中每个句子的三元组是否等于 A 中每个句子的三元组,如果是,则 textBox3 将为空。
简而言之,我正在尝试编写类似于 Ferret 的代码,但适用于 C#,并且仅用于比较两个给定的纯文本。据我所知,C# 还没有类似的东西。
如有任何帮助或提示,我们将不胜感激。谢谢!
比较正文
比较两个正文并将它们标记为相似(如果它们至少有一个句子级别的三元组是共同的)是相当简单的:
public bool AreTextsSimilar(string a, string b)
{
// We can reuse these objects - they could be stored in member fields:
ITokenizer tokenizer = new ClassicTokenizer(true);
SentenceSegmenter segmenter = new TokenBasedSentenceSegmenter(tokenizer);
NGramExtractor trigramExtractor = new NGramExtractor(3);
IEnumerable<string> sentencesA = segmenter.GetSentences(a);
IEnumerable<string> sentencesB = segmenter.GetSentences(b);
// The order of trigrams doesn't matter, so we'll fetch them as sets instead,
// to make comparisons between their elements more efficient:
ISet<NGram> trigramsA = trigramExtractor.ExtractAsSet(sentencesA);
ISet<NGram> trigramsB = trigramExtractor.ExtractAsSet(sentencesB);
// 'Intersect' returns all elements that are found in both collections:
IEnumerable<NGram> sharedTrigrams = trigramsA.Intersect(trigramsB);
// 'Any' only returns true if the collection isn't empty:
return sharedTrigrams.Any();
}
如果没有 Linq
方法(Intersect
、Any
),最后两行可以作为循环实现:
foreach (NGram trigramA in trigramsA)
{
// As soon as we find a shared sentence trigram we can conclude that
// the two bodies of text are indeed similar:
if (trigramsB.Contains(trigramA))
return true;
}
return false;
}
没有共享词三元组的句子
检索所有不共享词级三元组的句子需要更多工作:
public IEnumerable<string> GetUniqueBSentences(string a, string b)
{
// We can reuse these objects - they could be stored in member fields:
ITokenizer tokenizer = new ClassicTokenizer(true);
SentenceSegmenter segmenter = new TokenBasedSentenceSegmenter(tokenizer);
NGramExtractor trigramExtractor = new NGramExtractor(3);
IEnumerable<string> sentencesA = segmenter.GetSentences(a);
IEnumerable<string> sentencesB = segmenter.GetSentences(b);
ITokenizer wordTokenizer = new ClassicTokenizer(false);
foreach (string sentenceB in sentencesB)
{
IList<string> wordsB = wordTokenizer.Tokenize(sentenceB);
ISet<NGram> wordTrigramsB = trigramExtractor.ExtractAsSet(wordsB);
bool foundMatchingSentence = false;
foreach (string sentenceA in sentencesA)
{
// This will be repeated for every sentence in B. It would be more efficient
// to generate trigrams for all sentences in A once, before we enter these loops:
IList<string> wordsA = wordTokenizer.Tokenize(sentenceA);
ISet<NGram> wordTrigramsA = trigramExtractor.ExtractAsSet(wordsA);
if (wordTrigramsA.Intersect(wordTrigramsB).Any())
{
// We found a sentence in A that shares word-trigrams, so stop comparing:
foundMatchingSentence = true;
break;
}
}
// No matching sentence in A? Then this sentence is unique to B:
if (!foundMatchingSentence)
yield return sentenceB;
}
}
显然 segmenter
还 returns 一个额外的空句子,您可能希望将其过滤掉(或弄清楚如何防止 segmenter
这样做)。
如果性能是一个问题,我相信可以优化上面的代码,但我会把它留给你。
我对编码还很陌生,所以我想我自己没有看到明显的答案,所以如果这是一个愚蠢的问题,我很抱歉,但我真的被困在这里了。我正在尝试比较来自两个不同文本(A 和 B)的两组三元组。如果 A 上没有 B 中的八卦,那么我会说这两个文本是不同的,至少就我目前的目的而言。我正在使用 Nuve 提取三元组。
到目前为止我有这个:
var paragraph = "This is not a phrase. This is not a sentence.";
var paragraph2 = "This is a phrase. This is a sentence. This have nothing to do with sentences.";
ITokenizer tokenizer = new ClassicTokenizer(true);
SentenceSegmenter segmenter = new TokenBasedSentenceSegmenter(tokenizer);
var sentences = segmenter.GetSentences(paragraph);
ITokenizer tokenizer2 = new ClassicTokenizer(true);
SentenceSegmenter segmenter2 = new TokenBasedSentenceSegmenter(tokenizer2);
var sentences2 = segmenter2.GetSentences(paragraph2);
var extractor = new NGramExtractor(3);
var grams1 = extractor.ExtractAsList(sentences);
var grams2 = extractor.ExtractAsList(sentences2);
var nonintersect = grams2.Except(grams1);
foreach (var nGram in nonintersect)
{
var current = nGram;
bool found = false;
foreach (var n in grams2)
{
if (!found)
{
if (n == current)
{
found = true;
}
}
}
if (!found)
{
var result = current;
string finalresult = Convert.ToString(result);
textBox3.AppendText(finalresult+ "\n");
}
通过这种方式,我希望得到在 B 中不存在于 A 中的句子(即示例中 B 中的所有句子),但现在我必须将 B 中的每个三元组与每个三元组进行比较从 A 看句子之间是否真的不同。我试图用另一个嵌套的 foreach 这样做,但我得到的只是无意义的数据,如下所示:
foreach (var sentence2 in sentences2)
{
var actual = sentence2;
bool found1 = false;
foreach (var sentence in sentences)
{
if (!found1)
{
if (actual == sentence)
{
found1 = true;
}
}
}
if (!found1)
{
string finalresult= Convert.ToString(actual);
textBox3.AppendText(finalresult+ "\n");
}
}
这样做我尝试验证 B 中每个句子的三元组是否等于 A 中每个句子的三元组,如果是,则 textBox3 将为空。
简而言之,我正在尝试编写类似于 Ferret 的代码,但适用于 C#,并且仅用于比较两个给定的纯文本。据我所知,C# 还没有类似的东西。
如有任何帮助或提示,我们将不胜感激。谢谢!
比较正文
比较两个正文并将它们标记为相似(如果它们至少有一个句子级别的三元组是共同的)是相当简单的:
public bool AreTextsSimilar(string a, string b)
{
// We can reuse these objects - they could be stored in member fields:
ITokenizer tokenizer = new ClassicTokenizer(true);
SentenceSegmenter segmenter = new TokenBasedSentenceSegmenter(tokenizer);
NGramExtractor trigramExtractor = new NGramExtractor(3);
IEnumerable<string> sentencesA = segmenter.GetSentences(a);
IEnumerable<string> sentencesB = segmenter.GetSentences(b);
// The order of trigrams doesn't matter, so we'll fetch them as sets instead,
// to make comparisons between their elements more efficient:
ISet<NGram> trigramsA = trigramExtractor.ExtractAsSet(sentencesA);
ISet<NGram> trigramsB = trigramExtractor.ExtractAsSet(sentencesB);
// 'Intersect' returns all elements that are found in both collections:
IEnumerable<NGram> sharedTrigrams = trigramsA.Intersect(trigramsB);
// 'Any' only returns true if the collection isn't empty:
return sharedTrigrams.Any();
}
如果没有 Linq
方法(Intersect
、Any
),最后两行可以作为循环实现:
foreach (NGram trigramA in trigramsA)
{
// As soon as we find a shared sentence trigram we can conclude that
// the two bodies of text are indeed similar:
if (trigramsB.Contains(trigramA))
return true;
}
return false;
}
没有共享词三元组的句子
检索所有不共享词级三元组的句子需要更多工作:
public IEnumerable<string> GetUniqueBSentences(string a, string b)
{
// We can reuse these objects - they could be stored in member fields:
ITokenizer tokenizer = new ClassicTokenizer(true);
SentenceSegmenter segmenter = new TokenBasedSentenceSegmenter(tokenizer);
NGramExtractor trigramExtractor = new NGramExtractor(3);
IEnumerable<string> sentencesA = segmenter.GetSentences(a);
IEnumerable<string> sentencesB = segmenter.GetSentences(b);
ITokenizer wordTokenizer = new ClassicTokenizer(false);
foreach (string sentenceB in sentencesB)
{
IList<string> wordsB = wordTokenizer.Tokenize(sentenceB);
ISet<NGram> wordTrigramsB = trigramExtractor.ExtractAsSet(wordsB);
bool foundMatchingSentence = false;
foreach (string sentenceA in sentencesA)
{
// This will be repeated for every sentence in B. It would be more efficient
// to generate trigrams for all sentences in A once, before we enter these loops:
IList<string> wordsA = wordTokenizer.Tokenize(sentenceA);
ISet<NGram> wordTrigramsA = trigramExtractor.ExtractAsSet(wordsA);
if (wordTrigramsA.Intersect(wordTrigramsB).Any())
{
// We found a sentence in A that shares word-trigrams, so stop comparing:
foundMatchingSentence = true;
break;
}
}
// No matching sentence in A? Then this sentence is unique to B:
if (!foundMatchingSentence)
yield return sentenceB;
}
}
显然 segmenter
还 returns 一个额外的空句子,您可能希望将其过滤掉(或弄清楚如何防止 segmenter
这样做)。
如果性能是一个问题,我相信可以优化上面的代码,但我会把它留给你。