在 C 中解析标记字符串

Question

我正在尝试用 C 语言解析 CSV 文件。我将文件的每一行都扫描到名为 lines 的数组中，该数组有效。然后，我检查行中的每个字符，看它是否是逗号 (44).

我在处理最后一个 else 语句时遇到问题，它应该在有逗号时开始一个新标记。

该行的第一个标记总是被正确读取，但其余的则不是（奇怪的 symbols/characters 出现在输出中）。我尝试删除 '\0' 语句，因为我不确定我是否需要它，但我遇到了同样的问题。我猜这是某种未定义的行为，但我不确定。

谢谢！

//[rows = num strings] [max num chars per string]
int max_len = 21;
int num_strings = 12;
char lines[num_strings][max_len];

//Open file
data = fopen("data.txt", "r");

//Check if file opened correctly
if (data == NULL) { 
    printf ("File did not open correctly.\n");
}

//Parse each token
char tokens[60][21];
int counter = 0;
//Read each line
for(int i=0; i<num_strings; i++)
{
    //Scan line into lines[i]
    fscanf(data, "%s", lines[i]);

    printf("\nThis line = %s\n",lines[i]);

    //Read each char in line
    for(int j=0; j<strlen(lines[i]); j++)
    {
        char *c = &lines[i][j];
        //printf("Current char of line: %c\n", c[0]);

        //If it's not a comma (or null character), add to current token
        if(c[0] != 44) {
           tokens[counter][j] = c[0];
        } else {//If it is, terminate string and go to next token
            tokens[counter][j] = '[=10=]';
            printf("This token = %s\n",tokens[counter]);
            counter++;
        }
    }
}

Answer 1

您的代码有几个问题，我将首先为您提供程序的工作主内部循环：

    int tok_i = 0;
    int jmax = strlen(lines[i]) + 1;
    for(int j = 0; j < jmax; j++)
    {
        char *c = &lines[i][j];
        //printf("Current char of line: %c\n", c[0]);

        //If it's not a comma (or null character), add to current token
        if(c[0] != 44 && c[0] != '[=10=]') {
            tokens[counter][tok_i] = c[0];
            tok_i++;
        } else {//If it is, terminate string and go to next token
            tokens[counter][tok_i] = '[=10=]';
            printf("This token = %s\n",tokens[counter]);
            counter++;
            tok_i = 0;
        }
    }

您的代码不起作用的主要原因是您正在写入 tokens[counter][j]，其中 j 是您在行中的当前位置。这对于一行的第一个标记很好，其中标记的第一个字符是该行的第一个字符，但对于后续标记，标记的第一个字符将位于行内的某个位置，其中 j 不会等于 0!

为了解决这个问题，我刚刚添加了另一个计数器，tok_i 用于跟踪我们当前在当前标记中的位置。每当我们找不到逗号或 null 时就必须递增它，而当我们知道我们将要在下一个循环中开始一个新标记时，只要我们找到逗号或 null 就重置它。

使用此方法，我们必须显式检查字符串末尾的 [=15=] 字符，此时第二个问题变得明显。 strlen 给出了字符串的长度，不包括 [=15=] 字符，因为我们想遍历包含 [=15=] 字符的行，所以我们需要设置 [=19= 的结束条件] 循环 j<strlen(lines[i]) + 1.

您还会注意到，在条件循环中使用 strlen 没有什么意义：strlen(lines[i]) 在循环过程中不会改变，但我们要求 strlen(lines[i])每次迭代都要评估，浪费一点时间。这可能是由编译器为我们修复的，但以防万一我们通过在变量 jmax.

中评估循环条件外循环的中断条件来确定修复它

其他问题包括 fscanf(data, "%s", &lines[i]); 仅在您 fscanfing 的行中没有空格时才有效。通常在这种情况下使用 fgets，整行包括空格。

此外，硬编码输入文件的行数也是不必要的，但如果输入的长度非常确定，则可以接受。

Answer 2

我的建议是画出你的字符串图，假设你有这条线，你会找到第一个逗号：

      .          1         2
      .01234567890123456789012
 i -> |aaaa,bbb,cccccc,dddd,e[=10=]
      .    ^ 
           j

这是 tokens 数组：

          01234      
 counter |aaaa[=11=]

现在你递增 counter 但 j 会继续，所以下次你将有：

      .          1         2
      .01234567890123456789012
 i -> |aaaa,bbb,cccccc,dddd,e[=12=]
      .        ^ 
               j

tokens 数组中的下一行将是：

            01234 567     
           |aaaa[=13=] 
   counter |????? bbb[=13=]

不完全是你想要的，对吧？

你应该找到另一种方法来复制令牌数组中的字符。

我可以建议如果你只需要填充 token 数组，你可以完全去掉这些行并一次读取文件一个字符吗？

此外，我想这只是为了练习，因为您没有提到 CSV 可能在字符串中包含逗号这一事实：

  aaaa,"bb,bb",ccc

有三个字段。

在 C 中解析标记字符串

Parsing token strings in C

c

char

token