复杂的字符串拆分

Complex string splitting

我有如下字符串:

[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)

你可以把它看成这棵树:

- [Testing.User]
- Info
        - [Testing.Info]
        - Name
                - [System.String]
                - Matt
        - Age
                - [System.Int32]
                - 21
- Description
        - [System.String]
        - This is some description

如您所见,它是 class Testing.User

的字符串序列化/表示

我希望能够进行拆分并在生成的数组中获取以下元素:

 [0] = [Testing.User]
 [1] = Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))
 [2] = Description:([System.String]|This is some description)

我不能按 | 拆分,因为那样会导致:

 [0] = [Testing.User]
 [1] = Info:([Testing.Info]
 [2] = Name:([System.String]
 [3] = Matt)
 [4] = Age:([System.Int32]
 [5] = 21))
 [6] = Description:([System.String]
 [7] = This is some description)

如何获得预期结果?

我不太擅长正则表达式,但我知道这是一种非常可能的解决方案。

这不是 great/robust 解决方案,但如果您知道您的三个顶级项目是固定的,那么您可以将它们硬编码到您的正则表达式中。

(\[Testing\.User\])\|(Info:.*)\|(Description:.*)

如您所料,此正则表达式将创建一个包含三个组的匹配项。你可以在这里测试它: http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx

编辑:这是一个完整的 C# 示例

using System;
using System.Text.RegularExpressions;

namespace ConsoleApplication3
{
    internal class Program
    {
        private static void Main(string[] args)
        {
            const string input = @"[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
            const string pattern = @"(\[Testing\.User\])\|(Info:.*)\|(Description:.*)";

            var match = Regex.Match(input, pattern);
            if (match.Success)
            {
                for (int i = 1; i < match.Groups.Count; i++)
                {
                    Console.WriteLine("[" + i + "] = " + match.Groups[i]);
                }
            }

            Console.ReadLine();
        }
    }
}

假设您的群组可以标记为

  1. [Anything.Anything]
  2. Anything:ReallyAnything(字母和数字 only:Then 任意数量的字符)在第一个管道
  3. 之后
  4. 任何东西:ReallyAnything(字母和数字 only:Then 任何数量的字符)在最后一个管道之后

然后你有一个像这样的模式:

"(\[\w+\.\w+\])\|(\w+:.+)\|(\w+:.+)";
  • (\[\w+\.\w+\]) 此捕获组将获得“[Testing.User]”,但不仅限于“[Testing.User]”
  • \|(\w+:.+)这个捕获组会在第一个管道之后获取数据,在最后一个管道之前停止。在这种情况下,"Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))"但不限于以"Info:"
  • 开头
  • \|(\w+:.+) 与之前相同的捕获组,但捕获最后一个管道之后的任何内容,在本例中 "Description:([System.String]|This is some description)" 但不限于以 Description:"
  • 开头

现在,如果您要添加另一个管道,后跟更多数据 (|Anything:SomeData),那么 Description: 将成为第 2 组的一部分,而第 3 组现在将是“Anything:SomeData” .

代码如下:

using System;
using System.Text.RegularExpressions;

public class Program
{
    public static void Main()
    {
        String text = "[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
        String pattern = "(\[\w+\.\w+\])\|(\w+:.+)\|(\w+:.+)";

        Match match = Regex.Match(text, pattern);
        if (match.Success)
        {
            Console.WriteLine(match.Groups[1]);
            Console.WriteLine(match.Groups[2]);
            Console.WriteLine(match.Groups[3]); 
        }
    }
}

结果:

[Testing.User]
Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))
Description:([System.String]|This is some description)

在此处查看工作示例...https://dotnetfiddle.net/DYcZuY

如果我按照此处的模式格式添加另一个字段,请查看工作示例...https://dotnetfiddle.net/Mtc1CD

为此,您需要使用 balancing groups,这是 .net 正则表达式引擎独有的正则表达式功能。这是一个计数器系统,当找到左括号时计数器递增,当找到右括号时计数器递减,然后您只需测试计数器是否为空即可知道括号是否平衡。 这是确保您在括号内或括号外的唯一方法:

using System;
using System.Text.RegularExpressions;

public class Example
{
   public static void Main()
   {
       string input = @"[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";

       string pattern = @"(?:[^|()]+|\((?>[^()]+|(?<Open>[(])|(?<-Open>[)]))*(?(Open)(?!))\))+";

       foreach (Match m in Regex.Matches(input, pattern)) 
           Console.WriteLine(m.Value);
   }
}

demo

图案详情:

(?:
    [^|()]+    # all that is not a parenthesis or a pipe
  |            # OR
               # content between parenthesis (eventually nested)
    \(              # opening parenthesis
     # here is the way to obtain balanced parens
    (?> # content between parens
        [^()]+        # all that is not parenthesis 
      |               # OR
        (?<Open>[(])  # an opening parenthesis (increment the counter)
      |
        (?<-Open>[)]) # a closing parenthesis (decrement the counter)
    )*  # repeat as needed
    (?(Open)(?!)) # make the pattern fail if the counter is not zero

    \)
)+

(?(open) (?!) )是一个条件语句。

(?!) 是一个始终为假的子模式(一个空的否定前瞻),这意味着:后面没有任何内容

此模式匹配所有不是竖线和括号内的字符串。

使用正则表达式先行

您可以像这样使用正则表达式:

(\[.*?])|(\w+:.*?)\|(?=Description:)|(Description:.*)

Working demo

此正则表达式背后的想法是在 123 组中捕获您想要的内容。

你可以通过这张图很容易地看到:

匹配信息

MATCH 1
1.  [0-14]   `[Testing.User]`
MATCH 2
2.  [15-88]  `Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))`
MATCH 3
3.  [89-143] `Description:([System.String]|This is some description)`

常规正则表达式

另一方面,如果你不喜欢上面的正则表达式,你可以使用另一个像这样的正则表达式:

(\[.*?])\|(.*)\|(Description:.*)

Working demo

或者至少强制一个字符:

(\[.+?])\|(.+)\|(Description:.+)

正则表达式不是解决此类问题的最佳方法,您可能需要编写一些代码来解析您的数据,我做了一个简单的示例来实现您的这个简单案例。这里的基本思想是,只有当 | 不在括号内时才需要拆分,所以我会跟踪括号计数。例如,您需要解决一些威胁案例,其中括号是描述部分的一部分,但正如我所说,这只是一个起点:

static IEnumerable<String> splitSpecial(string input)
{
    StringBuilder builder = new StringBuilder();
    int openParenthesisCount = 0;

    foreach (char c in input)
    {
        if (openParenthesisCount == 0 && c == '|')
        {
            yield return builder.ToString();
            builder.Clear();
        }
        else
        {
            if (c == '(')
                openParenthesisCount++;
            if (c == ')')
                openParenthesisCount--;
            builder.Append(c);
        }
    }
    yield return builder.ToString();
}

static void Main(string[] args)
{
    string input = "[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
    foreach (String split in splitSpecial(input))
    {
        Console.WriteLine(split);
    }
    Console.ReadLine();
}

输出:

[Testing.User]
Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))
Description:([System.String]|This is some description)

已经有足够多的分裂答案了,所以这是另一种方法。如果您的输入表示树结构,为什么不将其解析为树? 以下代码是从 VB.NET 自动翻译而来的,但据我测试它应该可以工作。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace Treeparse
{
    class Program
    {
        static void Main(string[] args)
        {
            var input = "[Testing.User]|Info:([Testing.Info]|Name:([System.String]|Matt)|Age:([System.Int32]|21))|Description:([System.String]|This is some description)";
            var t = StringTree.Parse(input);
            Console.WriteLine(t.ToString());
            Console.ReadKey();
        }
    }

    public class StringTree
    {
        //Branching constants
        const string BranchOff = "(";
        const string BranchBack = ")";
        const string NextTwig = "|";

        //Content of this twig
        public string Text;
        //List of Sub-Twigs
        public List<StringTree> Twigs;
        [System.Diagnostics.DebuggerStepThrough()]
        public StringTree()
        {
            Text = "";
            Twigs = new List<StringTree>();
        }

        private static void ParseRecursive(StringTree Tree, string InputStr, ref int Position)
        {
            do {
                StringTree NewTwig = new StringTree();
                do {
                    NewTwig.Text = NewTwig.Text + InputStr[Position];
                    Position += 1;
                } while (!(Position == InputStr.Length || (new String[] { BranchBack, BranchOff, NextTwig }.ToList().Contains(InputStr[Position].ToString()))));
                Tree.Twigs.Add(NewTwig);
                if (Position < InputStr.Length && InputStr[Position].ToString() == BranchOff) { Position += 1; ParseRecursive(NewTwig, InputStr, ref Position); Position += 1; }
                if (Position < InputStr.Length && InputStr[Position].ToString() == BranchBack)
                    break; // TODO: might not be correct. Was : Exit Do
                Position += 1;
            } while (!(Position >= InputStr.Length || InputStr[Position].ToString() == BranchBack));
        }

        /// <summary>
        /// Call this to parse the input into a StringTree objects using recursion
        /// </summary>
        public static StringTree Parse(string Input)
        {
            StringTree t = new StringTree();
            t.Text = "Root";
            int Start = 0;
            ParseRecursive(t, Input, ref Start);
            return t;
        }

        private void ToStringRecursive(ref StringBuilder sb, StringTree tree, int Level)
        {
            for (int i = 1; i <= Level; i++)
            {
                sb.Append("   ");
            }
            sb.AppendLine(tree.Text);
            int NextLevel = Level + 1;
            foreach (StringTree NextTree in tree.Twigs)
            {
                ToStringRecursive(ref sb, NextTree, NextLevel);
            }
        }

        public override string ToString()
        {
            var sb = new System.Text.StringBuilder();
            ToStringRecursive(ref sb, this, 0);
            return sb.ToString();
        }

    }
}

结果(点击):

您可以在树状结构中获取每个节点的值及其关联的子值,然后您可以随心所欲地使用它,例如轻松地在 TreeView 控件中显示结构: