中间有 space 的字符串

Question

我正在尝试通过为每个标记构建 DFA 并在 C 中模拟它们来为 C 标记编写词法分析器。目前我正在尝试识别字符串文字。根据定义，字符串文字是包含在 " 之间的字符。考虑以下程序：

#include<stdio.h>
int main()
{
    char *a = "Hello "


    "World";
    printf("%s",a);
}

输出：

Hello World

所以现在我很困惑是应该将 Hello 和 World 视为单独的标记，还是将 Hello World 合并为一个标记？谢谢！ :)

Answer 1

在我写的评论中

"Hello" and "World" are two separate tokens. That's a lexical analysis consideration. When they appear as consecutive tokens, they represent two parts of a single string literal. That's a semantic consideration -- i.e. what that combination of tokens means in C source code.

这从常规的通用编译器构造的角度描述了问题的观点。例如，区别在于 lex 扫描器定义中可能表示的内容与 yacc 解析器描述中可能处理的内容（以传统工具的形式）。

实际上，C 定义了更大更详细的 "translation phases" 集，用于从 C 源 (C99 5.1.1.2) 构建可执行程序。在 C 的特定过程模型中，"Hello" 和 "World" 是单独的 预处理标记 ，在翻译阶段 3 中识别。它们在翻译时连接成单个标记阶段 6。所有（剩余的）预处理标记在翻译阶段 7 被直接转换为 tokens。生成的标记然后是语义分析的输入（也是阶段 7 的一部分） .

C 不需要根据给定模型及其所有独立阶段实际实现翻译（编译）的实现，而且很多都不需要。 C 只要求最终结果 就好像 实现的行为符合模型。从这个意义上说，你的问题只能回答"it depends"。至于推断问题的非特定于 C 的概念化 "what is a token"，我将坚持认为我最初的简短描述提供了一个有用的心智模型。

中间有 space 的字符串

strings with space between them

c

compiler-construction

string