是否有可能找出字符串列表中的公共部分
Is It possible to find out what are the common part in String List
我正在努力找出字符串列表中的公共字符串部分。如果我们取样本数据集
private readonly List<string> Xpath = new List<string>()
{
"BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(1)>H2:nth-of-type(1)",
"BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(2)>H2:nth-of-type(1)",
"BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(3)>H2:nth-of-type(1)",
"BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(4)>H2:nth-of-type(1)",
"BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(5)>H2:nth-of-type(1)",
"BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(6)>H2:nth-of-type(1)",
"BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(7)>H2:nth-of-type(1)",
"BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(8)>H2:nth-of-type(1)",
"BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(9)>H2:nth-of-type(1)"
};
由此,我想找出这些 children 与哪些相似。 data 是一个 Xpath 列表。
以编程方式我应该能够告诉
预期输出:
BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV
为了得到这个我做了这样的事情。我用 >
分隔每个项目,然后最初为每个数据集创建一个项目列表。
然后用这个找出独特的物品是什么
private IEnumerable<T> GetCommonItems<T>(IEnumerable<T>[] lists)
{
HashSet<T> hs = new HashSet<T>(lists.First());
for (int i = 1; i < lists.Length; i++)
{
hs.IntersectWith(lists[i]);
}
return hs;
}
能够找出唯一值并再次创建数据集。但是如果这包含 Ex:- Div 在两个地方并且它也在每个原始数据集中,那么即使这样这个方法也只会选择一个 Div.
从那时起我会得到这样的东西:
BODY>MAIN:nth-of-type(1)>DIV>SECTION
但我需要这个
BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-
type(3)>DIV>ARTICLE>DIV>DIV>DIV
免责声明:这不是最高效的解决方案,但它有效:)
- 让我们从第一个路径开始
>
个字符
- 对所有路径执行相同的操作
char separator = '>';
IEnumerable<string> firstPathChunks = Xpath[0].Split(separator);
var chunks = Xpath.Select(path => path.Split(separator).ToList()).ToArray();
- 遍历
firstPathChunks
- 遍历
chunks
- 如果匹配则删除第一个元素
- 如果删除所有第一个元素,则将匹配的前缀附加到
sb
void Process(StringBuilder sb)
{
foreach (var pathChunk in firstPathChunks)
{
foreach (var chunk in chunks)
{
if (chunk[0] != pathChunk)
{
return;
}
chunk.RemoveAt(0);
}
sb.Append(pathChunk);
sb.Append(separator);
}
}
示例用法
var sb = new StringBuilder();
Process(sb);
Console.WriteLine(sb.ToString());
输出
BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>
用分隔符 >
解析字符串是个好主意。而不是然后创建一个唯一项目的列表,你应该创建一个包含在字符串中的所有项目的列表,这将导致
{
"BODY",
"MAIN:nth-of-type(1)",
"DIV",
"SECTTION",
"DIV",
...
}
对于您的 XPath 列表的第一个条目。
通过这种方式,您可以创建一个 List<List<string>>
,其中包含 XPath 列表中每个条目的每个元素。然后您可以比较内部列表的所有第一个元素。如果它们相等,则将该元素值保存到您的输出并继续处理所有第二个元素,依此类推,直到您在所有外部列表中找到一个不相等的元素。
编辑:
用 >
分隔符分隔您的列表后,它可能看起来像这样:
List<List<string>> XPathElementsLists;
List<string> resultElements = new List<string>();
string result;
XPathElementsLists = ParseElementsFormXPath(XPath);
for (int i = 0; i < XPathElementsLists[0].Count; i++)
{
bool isEqual = true;
string compareElemment = XPathElementsLists[0][i];
foreach (List<string> element in XPathElementsLists)
{
if (!String.Equals(compareElemment, element))
{
isEqual = false;
break;
}
}
if (!isEqual)
{
break;
}
resultElements.Add(compareElemment);
}
result = String.Join(">", resultElements.ToArray());
我正在努力找出字符串列表中的公共字符串部分。如果我们取样本数据集
private readonly List<string> Xpath = new List<string>()
{
"BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(1)>H2:nth-of-type(1)",
"BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(2)>H2:nth-of-type(1)",
"BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(3)>H2:nth-of-type(1)",
"BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(4)>H2:nth-of-type(1)",
"BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(5)>H2:nth-of-type(1)",
"BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(6)>H2:nth-of-type(1)",
"BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(7)>H2:nth-of-type(1)",
"BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(8)>H2:nth-of-type(1)",
"BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>SECTION:nth-of-type(9)>H2:nth-of-type(1)"
};
由此,我想找出这些 children 与哪些相似。 data 是一个 Xpath 列表。 以编程方式我应该能够告诉
预期输出:
BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV
为了得到这个我做了这样的事情。我用 >
分隔每个项目,然后最初为每个数据集创建一个项目列表。
然后用这个找出独特的物品是什么
private IEnumerable<T> GetCommonItems<T>(IEnumerable<T>[] lists)
{
HashSet<T> hs = new HashSet<T>(lists.First());
for (int i = 1; i < lists.Length; i++)
{
hs.IntersectWith(lists[i]);
}
return hs;
}
能够找出唯一值并再次创建数据集。但是如果这包含 Ex:- Div 在两个地方并且它也在每个原始数据集中,那么即使这样这个方法也只会选择一个 Div.
从那时起我会得到这样的东西:
BODY>MAIN:nth-of-type(1)>DIV>SECTION
但我需要这个
BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of- type(3)>DIV>ARTICLE>DIV>DIV>DIV
免责声明:这不是最高效的解决方案,但它有效:)
- 让我们从第一个路径开始
>
个字符 - 对所有路径执行相同的操作
char separator = '>';
IEnumerable<string> firstPathChunks = Xpath[0].Split(separator);
var chunks = Xpath.Select(path => path.Split(separator).ToList()).ToArray();
- 遍历
firstPathChunks
- 遍历
chunks
- 如果匹配则删除第一个元素
- 如果删除所有第一个元素,则将匹配的前缀附加到
sb
- 遍历
void Process(StringBuilder sb)
{
foreach (var pathChunk in firstPathChunks)
{
foreach (var chunk in chunks)
{
if (chunk[0] != pathChunk)
{
return;
}
chunk.RemoveAt(0);
}
sb.Append(pathChunk);
sb.Append(separator);
}
}
示例用法
var sb = new StringBuilder();
Process(sb);
Console.WriteLine(sb.ToString());
输出
BODY>MAIN:nth-of-type(1)>DIV>SECTION>DIV>SECTION>DIV>DIV:nth-of-type(1)>DIV>DIV:nth-of-type(3)>DIV>ARTICLE>DIV>DIV>DIV>
用分隔符 >
解析字符串是个好主意。而不是然后创建一个唯一项目的列表,你应该创建一个包含在字符串中的所有项目的列表,这将导致
{
"BODY",
"MAIN:nth-of-type(1)",
"DIV",
"SECTTION",
"DIV",
...
}
对于您的 XPath 列表的第一个条目。
通过这种方式,您可以创建一个 List<List<string>>
,其中包含 XPath 列表中每个条目的每个元素。然后您可以比较内部列表的所有第一个元素。如果它们相等,则将该元素值保存到您的输出并继续处理所有第二个元素,依此类推,直到您在所有外部列表中找到一个不相等的元素。
编辑:
用 >
分隔符分隔您的列表后,它可能看起来像这样:
List<List<string>> XPathElementsLists;
List<string> resultElements = new List<string>();
string result;
XPathElementsLists = ParseElementsFormXPath(XPath);
for (int i = 0; i < XPathElementsLists[0].Count; i++)
{
bool isEqual = true;
string compareElemment = XPathElementsLists[0][i];
foreach (List<string> element in XPathElementsLists)
{
if (!String.Equals(compareElemment, element))
{
isEqual = false;
break;
}
}
if (!isEqual)
{
break;
}
resultElements.Add(compareElemment);
}
result = String.Join(">", resultElements.ToArray());