如何将行拆分为来自 txt 文件的数据表

Question

我有一个客户端数据的文本文件，如下所示

    :client objects (
    : (ThomasSmith
                :AdminInfo (
                    :client_uid ("{C6DD9C9C-964A-4BE5-30F1-3D64A87F73A6}")
                    :nickName (Tom)
                    )

                :addr ("1234 Pear Street")
                :city (Charlotte)
                :state (NC)
                :zip (12345)
                :phone ("555-555-5555")
                :email ("tom@someemailaddress.com")
                :gender (male)

            )       

    : (Jonathan Thomson
                :AdminInfo (
                    :client_uid ("{C6DD9C9C-964A-4BE5-30F1-3D64A87F73A7}")
                    :nickName (John)
                    )

                :addr ("5678 Apple Street")
                :city ("New York")
                :state (
                    :AdminInfo (
                    :chkpf_uid(:""{ B056A094-3164-42C9-888F-11071C1FCD9B}"")
                    :global_level(1)
                )
)
                :zip (56789)
                :phone ("555-444-6666")
                :email ("John@someemailaddress.com")
            )
    )

我需要能够将每个客户端的部分解析为列表或数据表。我坚持的是开始在 nameofclient 读取文件，并在该客户端结束时停止读取，而不是从 nameofclient2 获取数据。当出现特定的单词或模式时，有没有办法停止阅读我的文件？我不知道如何解决的问题之一是每个客户端可能有不同数量的属性，所以我不能硬编码一些行，我将不得不为“:([a-z]”或类似这样的东西。理想情况下，我希望它在格式类似于这样的数据表中：

    Name of customer | Attribute    
    ------------------------------
    Customer1        | Address(XXXXXX)
    Customer1        | ZipCode(XXXXXX)
    Customer1        | Etc...
    Customer2        | .....
    Customer2        | .....

无论如何，我是编码的新手，我还没有足够的经验来让它工作。这是我尝试过的：

 public partial class WebForm1 : System.Web.UI.Page
{
    protected void Page_Load(object sender, EventArgs e)
    {
        Main();
    }

    static void Main()
    {
        ruleset rs = new ruleset();
        System.IO.StreamReader br = new System.IO.StreamReader("f");

        string line = string.Empty;         
        bool GroupTrue = false;
        int numObjects1 = 0;          
        string cGroupName = "";

        while ((line = br.ReadLine()) != null)
        {
            if (line.Contains(":(client_objects"))
            {
                GroupTrue = true;

                string[] tempArray = line.Split(new string[] { "\r\n", "\n" }, StringSplitOptions.None);
                cGroupName = tempArray[tempArray.Length - 1];
            }
            else if (GroupTrue && !Regex.IsMatch(line, "") && (numObjects1 < 50))
            {
                numObjects1 = numObjects1 + 1;

                cGroup cGroup = new cGroup(cGroupName, line);


                rs.addGroups(cGroup);
            }
            else if (GroupTrue && Regex.IsMatch(line, ".*\b.*"))
            {
                GroupTrue = false;
            }
        }
    }

}

public class cGroup
{
    public string attribute;
    public string groups;

    public cGroup(String cGroupName, String line)
    {
        this.groups = cGroupName;
        this.attribute = line;
    }

}
public class ruleset
{
    //cGroup cResult = new cGroup();
    public List<cGroup> cGroups = new List<cGroup>();
    public void addGroups(cGroup cGroups)
    {
        this.cGroups.Add(cGroups);
    }
}

Answer 1

我建议使用 Regex 来处理您的文件，无论何时您尝试根据模式获取字符串数据，这显然是赢家。

不幸的是，要做到这一点可能相当复杂，请继续Regexr进行实验并获取一些参考信息。

例如 \((.*?)\) 会获取括号内的所有值。

Answer 2

我假设你的意思不是完全停止阅读，而是暂停阅读然后在前一批的行上做一些工作。为此，您可以执行以下操作：

public bool MatchesMyCondition(string line) {...}
public void DoSomething(List<string> lines) {...}

List<string> lines = new List<string>();
string line;

System.IO.StreamReader file = new System.IO.StreamReader("myFile.txt");
while((line = file.ReadLine()) != null)
{
    if (MatchesMyCondition(line))
    {
       DoSomething(lines);
       lines.Clear();
    }
    else
    {
        lines.Add(line);
    }
}
//handle the last items
DoSomething(lines);

正如神库所说，使用someRegex.IsMatch(line)是最通用的在线查找方式，但line.Contains(someSting)也足够了。

Answer 3

我理解对正则表达式的偏见，因为人们不愿意学习基础知识。通过使用这些基本原则（并避免在正则表达式中使用 .* 来消耗所有）

使用 + 一个或多个节 * 零个或多个（只谨慎使用 *）。
( )基本匹配捕获，我们感兴趣的是括号里的内容
(?<{Name is here}> ) 命名匹配捕获以便更轻松地提取匹配数据。
[^ ]+ 不是组，消耗直到你命中^后的字符。

因此，根据这些规则，我们在每个规则的基础上进行构建，并在数据中找到我称之为 anchors 的东西。这就是我们可以引导正则表达式解析器并使用命名匹配捕获来使用数据的地方。

模式

这是 C# 变量中的模式。

string pattern = @"
:\s+\(                 # Anchor text of Operation Start
(?<Name>[^\r\n]+)  # Named capture into `Name` match capture.
[^:]+:AdminInfo[^:]+   # More whitespace to admin and into first admin node.
    (                    # 1 to many admin nodes start.
      :                  # Anchor for admin node
      (?<ADKey>[^\s]+)       # Node key name into `ADKey` match capture
        \s+\(\x22?            # Anchor of `(` and possible quote (\x22) Start
       (?<ADValue>[^\x22\)]+) # Value of admin node
      \x22?\)\s+              # Anchor optional quote and `)` End
     )+                  # 1 to many admin nodes end
 \)                    # Close of Admin Info
 (                   # 1 to many nodes start.
    [^:]+:           # Consume whitespace and `:` anchor
    (?<Key>[^\s]+)      # Node name into match capture group `Key`
        \s+\(\x22?       # Anchor of `(` and possible quote (\x22) start
    (?<Value>[^\x22\)]+) # Value of admin node
      \x22?\)\s+         # Anchor End
  )+            # 1 to many nodes end
\s*\)           # Close of whole operation END";

注意 Name、ADKey、ADValue、Key 和 Value 的命名匹配捕获。在逐场比赛的基础上（每场比赛都是一个人），我们将提取该人的姓名。然后 ADKey、ADValue、Key 和 Value 中将包含四个单独的命名匹配值列表。这些代表数据的键值对，我们将 Zip（您使用的是 .net 4，对吗？）到键值对字典中。

C# Linq 逻辑

// Ignore pattern whitespace only allows us to comment the pattern
// it does not affect regex parsing.
// Explicit capture says only keep the items which fall within `(` and `)` for the final result.
// It is used to streamline the process somewhat for we don't need all the extraneous text/space.
Regex.Matches(text, pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture)
     .OfType<Match>()
     .Select (mt => new
     {
        Name      = mt.Groups["Name"].Value,
        AdminInfo = mt.Groups["ADKey"].Captures
                                      .OfType<Capture>()
                                      .Select (cp => cp.Value)
                                      .Zip(mt.Groups["ADValue"].Captures.OfType<Capture>().Select (cp => cp.Value),
                                           (k,v) => new {key = k, value = v})
                                      .ToDictionary (cp => cp.key, cp => cp.value ),
        Nodes     = mt.Groups["Key"].Captures
                                      .OfType<Capture>()
                                      .Select (cp => cp.Value)
                                      .Zip(mt.Groups["Value"].Captures.OfType<Capture>().Select (cp => cp.Value),
                                           (k,v) => new {key = k, value = v})
                                      .ToDictionary (cp => cp.key, cp => cp.value ),

    })

这会创建单独的数据实体，其中每个匹配项都被投影到（这就是 Select 所做的投影来自一种形式到另一种形式）具有 Name、AdminInfo 和 Nodes 属性的实体。 AdminInfo 和 Nodes 是包含 1 到多个键值对的字典。当针对数据（下方）进行处理时，这是结果数据，如 Linqpad

中所示

数据

string text = @":client objects (
: (ThomasSmith
            :AdminInfo (
                :client_uid (""{C6DD9C9C-964A-4BE5-30F1-3D64A87F73A6}"")
                :nickName (Tom)
                )

            :addr (""1234 Pear Street"")
            :city (Charlotte)
            :state (NC)
            :zip (12345)
            :phone (""555-555-5555"")
            :email (""tom@someemailaddress.com"")
            :gender (male)

        )

: (Jonathan Thomson
            :AdminInfo (
                :client_uid (""{C6DD9C9C-964A-4BE5-30F1-3D64A87F73A7}"")
                :nickName (John)
                )

            :addr (""5678 Apple Street"")
            :city (""New York"")
            :state (NY)
            :zip (56789)
            :phone (""555-444-6666"")
            :email (""John@someemailaddress.com"")
        )
";

我留给你处理上面 Regex.Matches 调用的最终实体结果。

如何将行拆分为来自 txt 文件的数据表

how to split lines into an from txt file into a datatable

.net

c#

parsing

string-split

模式

C# Linq 逻辑

数据