复杂的字符串拆分
Complex string splitting
我有如下字符串:
[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)
你可以把它看成这棵树:
- [Testing.User]
- Info
- [Testing.Info]
- Name
- [System.String]
- Matt
- Age
- [System.Int32]
- 21
- Description
- [System.String]
- This is some description
如您所见,它是 class Testing.User
的字符串序列化/表示
我希望能够进行拆分并在生成的数组中获取以下元素:
[0] = [Testing.User]
[1] = Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))
[2] = Description:([System.String]|This is some description)
我不能按 |
拆分,因为那样会导致:
[0] = [Testing.User]
[1] = Info:([Testing.Info]
[2] = Name:([System.String]
[3] = Matt)
[4] = Age:([System.Int32]
[5] = 21))
[6] = Description:([System.String]
[7] = This is some description)
如何获得预期结果?
我不太擅长正则表达式,但我知道这是一种非常可能的解决方案。
这不是 great/robust 解决方案,但如果您知道您的三个顶级项目是固定的,那么您可以将它们硬编码到您的正则表达式中。
(\[Testing\.User\])\|(Info:.*)\|(Description:.*)
如您所料,此正则表达式将创建一个包含三个组的匹配项。你可以在这里测试它:
http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
编辑:这是一个完整的 C# 示例
using System;
using System.Text.RegularExpressions;
namespace ConsoleApplication3
{
internal class Program
{
private static void Main(string[] args)
{
const string input = @"[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
const string pattern = @"(\[Testing\.User\])\|(Info:.*)\|(Description:.*)";
var match = Regex.Match(input, pattern);
if (match.Success)
{
for (int i = 1; i < match.Groups.Count; i++)
{
Console.WriteLine("[" + i + "] = " + match.Groups[i]);
}
}
Console.ReadLine();
}
}
}
假设您的群组可以标记为
- [Anything.Anything]
- Anything:ReallyAnything(字母和数字 only:Then 任意数量的字符)在第一个管道
之后
- 任何东西:ReallyAnything(字母和数字 only:Then 任何数量的字符)在最后一个管道之后
然后你有一个像这样的模式:
"(\[\w+\.\w+\])\|(\w+:.+)\|(\w+:.+)";
(\[\w+\.\w+\])
此捕获组将获得“[Testing.User]”,但不仅限于“[Testing.User]”
\|(\w+:.+)
这个捕获组会在第一个管道之后获取数据,在最后一个管道之前停止。在这种情况下,"Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))"但不限于以"Info:" 开头
\|(\w+:.+)
与之前相同的捕获组,但捕获最后一个管道之后的任何内容,在本例中 "Description:([System.String]|This is some description)" 但不限于以 Description:" 开头
现在,如果您要添加另一个管道,后跟更多数据 (|Anything:SomeData
),那么 Description:
将成为第 2 组的一部分,而第 3 组现在将是“Anything:SomeData
” .
代码如下:
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
String text = "[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
String pattern = "(\[\w+\.\w+\])\|(\w+:.+)\|(\w+:.+)";
Match match = Regex.Match(text, pattern);
if (match.Success)
{
Console.WriteLine(match.Groups[1]);
Console.WriteLine(match.Groups[2]);
Console.WriteLine(match.Groups[3]);
}
}
}
结果:
[Testing.User]
Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))
Description:([System.String]|This is some description)
在此处查看工作示例...https://dotnetfiddle.net/DYcZuY
如果我按照此处的模式格式添加另一个字段,请查看工作示例...https://dotnetfiddle.net/Mtc1CD
为此,您需要使用 balancing groups,这是 .net 正则表达式引擎独有的正则表达式功能。这是一个计数器系统,当找到左括号时计数器递增,当找到右括号时计数器递减,然后您只需测试计数器是否为空即可知道括号是否平衡。
这是确保您在括号内或括号外的唯一方法:
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string input = @"[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
string pattern = @"(?:[^|()]+|\((?>[^()]+|(?<Open>[(])|(?<-Open>[)]))*(?(Open)(?!))\))+";
foreach (Match m in Regex.Matches(input, pattern))
Console.WriteLine(m.Value);
}
}
图案详情:
(?:
[^|()]+ # all that is not a parenthesis or a pipe
| # OR
# content between parenthesis (eventually nested)
\( # opening parenthesis
# here is the way to obtain balanced parens
(?> # content between parens
[^()]+ # all that is not parenthesis
| # OR
(?<Open>[(]) # an opening parenthesis (increment the counter)
|
(?<-Open>[)]) # a closing parenthesis (decrement the counter)
)* # repeat as needed
(?(Open)(?!)) # make the pattern fail if the counter is not zero
\)
)+
(?(open) (?!) )
是一个条件语句。
(?!)
是一个始终为假的子模式(一个空的否定前瞻),这意味着:后面没有任何内容
此模式匹配所有不是竖线和括号内的字符串。
使用正则表达式先行
您可以像这样使用正则表达式:
(\[.*?])|(\w+:.*?)\|(?=Description:)|(Description:.*)
此正则表达式背后的想法是在 1
、2
和 3
组中捕获您想要的内容。
你可以通过这张图很容易地看到:
匹配信息
MATCH 1
1. [0-14] `[Testing.User]`
MATCH 2
2. [15-88] `Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))`
MATCH 3
3. [89-143] `Description:([System.String]|This is some description)`
常规正则表达式
另一方面,如果你不喜欢上面的正则表达式,你可以使用另一个像这样的正则表达式:
(\[.*?])\|(.*)\|(Description:.*)
或者至少强制一个字符:
(\[.+?])\|(.+)\|(Description:.+)
正则表达式不是解决此类问题的最佳方法,您可能需要编写一些代码来解析您的数据,我做了一个简单的示例来实现您的这个简单案例。这里的基本思想是,只有当 |
不在括号内时才需要拆分,所以我会跟踪括号计数。例如,您需要解决一些威胁案例,其中括号是描述部分的一部分,但正如我所说,这只是一个起点:
static IEnumerable<String> splitSpecial(string input)
{
StringBuilder builder = new StringBuilder();
int openParenthesisCount = 0;
foreach (char c in input)
{
if (openParenthesisCount == 0 && c == '|')
{
yield return builder.ToString();
builder.Clear();
}
else
{
if (c == '(')
openParenthesisCount++;
if (c == ')')
openParenthesisCount--;
builder.Append(c);
}
}
yield return builder.ToString();
}
static void Main(string[] args)
{
string input = "[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
foreach (String split in splitSpecial(input))
{
Console.WriteLine(split);
}
Console.ReadLine();
}
输出:
[Testing.User]
Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))
Description:([System.String]|This is some description)
已经有足够多的分裂答案了,所以这是另一种方法。如果您的输入表示树结构,为什么不将其解析为树?
以下代码是从 VB.NET 自动翻译而来的,但据我测试它应该可以工作。
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Treeparse
{
class Program
{
static void Main(string[] args)
{
var input = "[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
var t = StringTree.Parse(input);
Console.WriteLine(t.ToString());
Console.ReadKey();
}
}
public class StringTree
{
//Branching constants
const string BranchOff = "(";
const string BranchBack = ")";
const string NextTwig = "|";
//Content of this twig
public string Text;
//List of Sub-Twigs
public List<StringTree> Twigs;
[System.Diagnostics.DebuggerStepThrough()]
public StringTree()
{
Text = "";
Twigs = new List<StringTree>();
}
private static void ParseRecursive(StringTree Tree, string InputStr, ref int Position)
{
do {
StringTree NewTwig = new StringTree();
do {
NewTwig.Text = NewTwig.Text + InputStr[Position];
Position += 1;
} while (!(Position == InputStr.Length || (new String[] { BranchBack, BranchOff, NextTwig }.ToList().Contains(InputStr[Position].ToString()))));
Tree.Twigs.Add(NewTwig);
if (Position < InputStr.Length && InputStr[Position].ToString() == BranchOff) { Position += 1; ParseRecursive(NewTwig, InputStr, ref Position); Position += 1; }
if (Position < InputStr.Length && InputStr[Position].ToString() == BranchBack)
break; // TODO: might not be correct. Was : Exit Do
Position += 1;
} while (!(Position >= InputStr.Length || InputStr[Position].ToString() == BranchBack));
}
/// <summary>
/// Call this to parse the input into a StringTree objects using recursion
/// </summary>
public static StringTree Parse(string Input)
{
StringTree t = new StringTree();
t.Text = "Root";
int Start = 0;
ParseRecursive(t, Input, ref Start);
return t;
}
private void ToStringRecursive(ref StringBuilder sb, StringTree tree, int Level)
{
for (int i = 1; i <= Level; i++)
{
sb.Append(" ");
}
sb.AppendLine(tree.Text);
int NextLevel = Level + 1;
foreach (StringTree NextTree in tree.Twigs)
{
ToStringRecursive(ref sb, NextTree, NextLevel);
}
}
public override string ToString()
{
var sb = new System.Text.StringBuilder();
ToStringRecursive(ref sb, this, 0);
return sb.ToString();
}
}
}
结果(点击):
您可以在树状结构中获取每个节点的值及其关联的子值,然后您可以随心所欲地使用它,例如轻松地在 TreeView
控件中显示结构:
我有如下字符串:
[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)
你可以把它看成这棵树:
- [Testing.User]
- Info
- [Testing.Info]
- Name
- [System.String]
- Matt
- Age
- [System.Int32]
- 21
- Description
- [System.String]
- This is some description
如您所见,它是 class Testing.User
我希望能够进行拆分并在生成的数组中获取以下元素:
[0] = [Testing.User]
[1] = Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))
[2] = Description:([System.String]|This is some description)
我不能按 |
拆分,因为那样会导致:
[0] = [Testing.User]
[1] = Info:([Testing.Info]
[2] = Name:([System.String]
[3] = Matt)
[4] = Age:([System.Int32]
[5] = 21))
[6] = Description:([System.String]
[7] = This is some description)
如何获得预期结果?
我不太擅长正则表达式,但我知道这是一种非常可能的解决方案。
这不是 great/robust 解决方案,但如果您知道您的三个顶级项目是固定的,那么您可以将它们硬编码到您的正则表达式中。
(\[Testing\.User\])\|(Info:.*)\|(Description:.*)
如您所料,此正则表达式将创建一个包含三个组的匹配项。你可以在这里测试它: http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
编辑:这是一个完整的 C# 示例
using System;
using System.Text.RegularExpressions;
namespace ConsoleApplication3
{
internal class Program
{
private static void Main(string[] args)
{
const string input = @"[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
const string pattern = @"(\[Testing\.User\])\|(Info:.*)\|(Description:.*)";
var match = Regex.Match(input, pattern);
if (match.Success)
{
for (int i = 1; i < match.Groups.Count; i++)
{
Console.WriteLine("[" + i + "] = " + match.Groups[i]);
}
}
Console.ReadLine();
}
}
}
假设您的群组可以标记为
- [Anything.Anything]
- Anything:ReallyAnything(字母和数字 only:Then 任意数量的字符)在第一个管道 之后
- 任何东西:ReallyAnything(字母和数字 only:Then 任何数量的字符)在最后一个管道之后
然后你有一个像这样的模式:
"(\[\w+\.\w+\])\|(\w+:.+)\|(\w+:.+)";
(\[\w+\.\w+\])
此捕获组将获得“[Testing.User]”,但不仅限于“[Testing.User]”\|(\w+:.+)
这个捕获组会在第一个管道之后获取数据,在最后一个管道之前停止。在这种情况下,"Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))"但不限于以"Info:" 开头
\|(\w+:.+)
与之前相同的捕获组,但捕获最后一个管道之后的任何内容,在本例中 "Description:([System.String]|This is some description)" 但不限于以 Description:" 开头
现在,如果您要添加另一个管道,后跟更多数据 (|Anything:SomeData
),那么 Description:
将成为第 2 组的一部分,而第 3 组现在将是“Anything:SomeData
” .
代码如下:
using System;
using System.Text.RegularExpressions;
public class Program
{
public static void Main()
{
String text = "[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
String pattern = "(\[\w+\.\w+\])\|(\w+:.+)\|(\w+:.+)";
Match match = Regex.Match(text, pattern);
if (match.Success)
{
Console.WriteLine(match.Groups[1]);
Console.WriteLine(match.Groups[2]);
Console.WriteLine(match.Groups[3]);
}
}
}
结果:
[Testing.User]
Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))
Description:([System.String]|This is some description)
在此处查看工作示例...https://dotnetfiddle.net/DYcZuY
如果我按照此处的模式格式添加另一个字段,请查看工作示例...https://dotnetfiddle.net/Mtc1CD
为此,您需要使用 balancing groups,这是 .net 正则表达式引擎独有的正则表达式功能。这是一个计数器系统,当找到左括号时计数器递增,当找到右括号时计数器递减,然后您只需测试计数器是否为空即可知道括号是否平衡。 这是确保您在括号内或括号外的唯一方法:
using System;
using System.Text.RegularExpressions;
public class Example
{
public static void Main()
{
string input = @"[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
string pattern = @"(?:[^|()]+|\((?>[^()]+|(?<Open>[(])|(?<-Open>[)]))*(?(Open)(?!))\))+";
foreach (Match m in Regex.Matches(input, pattern))
Console.WriteLine(m.Value);
}
}
图案详情:
(?:
[^|()]+ # all that is not a parenthesis or a pipe
| # OR
# content between parenthesis (eventually nested)
\( # opening parenthesis
# here is the way to obtain balanced parens
(?> # content between parens
[^()]+ # all that is not parenthesis
| # OR
(?<Open>[(]) # an opening parenthesis (increment the counter)
|
(?<-Open>[)]) # a closing parenthesis (decrement the counter)
)* # repeat as needed
(?(Open)(?!)) # make the pattern fail if the counter is not zero
\)
)+
(?(open) (?!) )
是一个条件语句。
(?!)
是一个始终为假的子模式(一个空的否定前瞻),这意味着:后面没有任何内容
此模式匹配所有不是竖线和括号内的字符串。
使用正则表达式先行
您可以像这样使用正则表达式:
(\[.*?])|(\w+:.*?)\|(?=Description:)|(Description:.*)
此正则表达式背后的想法是在 1
、2
和 3
组中捕获您想要的内容。
你可以通过这张图很容易地看到:
匹配信息
MATCH 1
1. [0-14] `[Testing.User]`
MATCH 2
2. [15-88] `Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))`
MATCH 3
3. [89-143] `Description:([System.String]|This is some description)`
常规正则表达式
另一方面,如果你不喜欢上面的正则表达式,你可以使用另一个像这样的正则表达式:
(\[.*?])\|(.*)\|(Description:.*)
或者至少强制一个字符:
(\[.+?])\|(.+)\|(Description:.+)
正则表达式不是解决此类问题的最佳方法,您可能需要编写一些代码来解析您的数据,我做了一个简单的示例来实现您的这个简单案例。这里的基本思想是,只有当 |
不在括号内时才需要拆分,所以我会跟踪括号计数。例如,您需要解决一些威胁案例,其中括号是描述部分的一部分,但正如我所说,这只是一个起点:
static IEnumerable<String> splitSpecial(string input)
{
StringBuilder builder = new StringBuilder();
int openParenthesisCount = 0;
foreach (char c in input)
{
if (openParenthesisCount == 0 && c == '|')
{
yield return builder.ToString();
builder.Clear();
}
else
{
if (c == '(')
openParenthesisCount++;
if (c == ')')
openParenthesisCount--;
builder.Append(c);
}
}
yield return builder.ToString();
}
static void Main(string[] args)
{
string input = "[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
foreach (String split in splitSpecial(input))
{
Console.WriteLine(split);
}
Console.ReadLine();
}
输出:
[Testing.User]
Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))
Description:([System.String]|This is some description)
已经有足够多的分裂答案了,所以这是另一种方法。如果您的输入表示树结构,为什么不将其解析为树? 以下代码是从 VB.NET 自动翻译而来的,但据我测试它应该可以工作。
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
namespace Treeparse
{
class Program
{
static void Main(string[] args)
{
var input = "[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
var t = StringTree.Parse(input);
Console.WriteLine(t.ToString());
Console.ReadKey();
}
}
public class StringTree
{
//Branching constants
const string BranchOff = "(";
const string BranchBack = ")";
const string NextTwig = "|";
//Content of this twig
public string Text;
//List of Sub-Twigs
public List<StringTree> Twigs;
[System.Diagnostics.DebuggerStepThrough()]
public StringTree()
{
Text = "";
Twigs = new List<StringTree>();
}
private static void ParseRecursive(StringTree Tree, string InputStr, ref int Position)
{
do {
StringTree NewTwig = new StringTree();
do {
NewTwig.Text = NewTwig.Text + InputStr[Position];
Position += 1;
} while (!(Position == InputStr.Length || (new String[] { BranchBack, BranchOff, NextTwig }.ToList().Contains(InputStr[Position].ToString()))));
Tree.Twigs.Add(NewTwig);
if (Position < InputStr.Length && InputStr[Position].ToString() == BranchOff) { Position += 1; ParseRecursive(NewTwig, InputStr, ref Position); Position += 1; }
if (Position < InputStr.Length && InputStr[Position].ToString() == BranchBack)
break; // TODO: might not be correct. Was : Exit Do
Position += 1;
} while (!(Position >= InputStr.Length || InputStr[Position].ToString() == BranchBack));
}
/// <summary>
/// Call this to parse the input into a StringTree objects using recursion
/// </summary>
public static StringTree Parse(string Input)
{
StringTree t = new StringTree();
t.Text = "Root";
int Start = 0;
ParseRecursive(t, Input, ref Start);
return t;
}
private void ToStringRecursive(ref StringBuilder sb, StringTree tree, int Level)
{
for (int i = 1; i <= Level; i++)
{
sb.Append(" ");
}
sb.AppendLine(tree.Text);
int NextLevel = Level + 1;
foreach (StringTree NextTree in tree.Twigs)
{
ToStringRecursive(ref sb, NextTree, NextLevel);
}
}
public override string ToString()
{
var sb = new System.Text.StringBuilder();
ToStringRecursive(ref sb, this, 0);
return sb.ToString();
}
}
}
结果(点击):
您可以在树状结构中获取每个节点的值及其关联的子值,然后您可以随心所欲地使用它,例如轻松地在 TreeView
控件中显示结构: