为什么在 C++ 中注释掉多行注释不一致？

Question

所以我们知道

// This doesn't affect anything

/*
This doesn't affect anything either
*/

/*
/* /* /*
This doesn't affect anything
*/
This does because comments aren't recursive

/* /*
This doesn't affect anything
*/ */
This throws an error because the second * / is unmatched since comments aren't recursive

我听说它们不是递归的原因是它们 slow down the compiler，我想这是有道理的。然而，现在当我用更高级的语言（比如 Python）解析 C++ 代码时，我可以简单地使用正则表达式

"\/[\/]+((?![\n])[\s\S])*\r*\n"

匹配// single line comments，并使用

"\/\*((?!\*\/)[\s\S])*\*\/"

来匹配/* multiline comments */，然后循环遍历所有单行注释，删除它们，然后循环遍历所有多行注释并删除它们。或相反亦然。但这就是我被困的地方。似乎做一个或另一个是不够的，因为：

// /*
An error is thrown because the /* is ignored
*/

/*
This doesn't affect things because of mysterious reasons
// */

和

/*
This throws an error because the second * / is unmatched
// */ */

这种行为的原因是什么？它也是编译器解析事物的方式的产物吗？明确地说，我不想改变 C++ 的行为，我只想知道第二组示例背后的原因。

编辑：

所以，是的，更明确地说，我的问题是为什么以下三种（看似合理的）解释此行为的方法不起作用：

简单地忽略 // 之后一行中的所有字符，无论它们是 /* 还是 * /，即使您在多行注释中也是如此。
允许 / * 或 */ 后跟 // 仍然有效。
以上都是

我明白为什么不允许嵌套注释，因为它们需要堆栈和任意高的内存量。但是这三种情况不会。

再次编辑：

如果有人感兴趣，下面是按照此处讨论的正确注释规则提取 c/c++ 文件 python 中的注释的代码：

import re
commentScanner = re.Scanner([
  (r"\/[\/]+((?![\n])[\s\S])*\r*(\n{1})?", lambda scanner, token: ("//", token)),
  (r"\/\*((?!\*\/)[\s\S])*\*\/", lambda scanner, token: ("/* ... */", token)),
  (r"[\s\S]", lambda scanner, token: None)
])
commentScanner.scan("fds a45fsa//kjl fds4325lkjfa/*jfds/\nk\lj\/*4532jlfds5342a  l/*a/*b/*c\n//fdsafa\n\r\n/*jfd//a*/fd// fs54fdsa3\r\r//\r/*\r\n2a\n\n\nois")

Answer 1

并非不一致。现有行为既易于指定又易于实现，您的编译器正在正确实现它。参见标准中的 [lex.comment]。

The characters /* start a comment, which terminates with the characters */. These comments do not nest. The characters // start a comment, which terminates with the next new-line character. If there is a form-feed or a vertical-tab character in such a comment, only white-space characters shall appear between it and the new-line that terminates the comment; no diagnostic is required. [ Note: The comment characters //, /*, and */ have no special meaning within a // comment and are treated just like other characters. Similarly, the comment characters // and /* have no special meaning within a /* comment. — end note ]

如你所见，//可以用来注释掉/*和*/。只是注释不会嵌套，所以如果 // 已经在 /* 中，那么 // 根本不起作用。

Answer 2

评论开始时，评论结束前的所有内容都被视为评论。

所以 zero // one */ two 本身可以 zero // one */ 是前一行 /* */ 评论的结尾，评论之外有两个 two，或者它可以是一个新的单行注释，以 // one */ two 开头，注释外有 zero。

从理论上讲，// 不是有效的 C 标记或标记序列。所以在 C.

的注释或字符串之外没有带有 // 的程序

但是，在评论中 // 是合法的。所以一个头文件包含：

/* this is a C style comment
// with some cool
// slashes */

如果我们 // 注释掉结尾的 */，

将会中断。在 /* */ 评论中，// 被忽略。不能无缘无故地破坏与 C 的兼容性。

并且在 // 内，所有内容都将被忽略，直到行尾。不允许偷偷摸摸 /* 或允许。

解析规则非常简单——开始评论，吞咽并丢弃直到看到结束标记（换行符或 */ 视情况而定），然后继续解析。

由于 C++ 不是为通过正则表达式解析而设计的，因此您使用正则表达式解析它的困难要么未被考虑，要么被认为不重要。

Answer 3

是的，就像评论中的所有内容都只是文本一样，但是当您删除评论分隔符时，
暴露的文本可以再次被解析。
因此，如果该文本的一部分具有注释定界符文字，它们将作为新注释 delimiter 进行解析。

它始终是先到先得的问题，即从左到右的顺序。

认为解析注释可能过于简单了。
事实是必须同时解析引号（single/double），并且首先遇到 comments/quote 的任何内容。

最后，评论中的所有内容都被跳过意味着如果您删除外部
评论层，剩下的所有不是的有效评论将被解析为
语言的一部分。这意味着任何暴露的评论格式都不确定，
如果不是不可避免的话，出现解析错误的可能性很大。

我也相信 C++ 也有一个用于 // 样式注释的行延续形式。
例如：

// single line continuation\
continuation               \  
end here 
code

所以用正则表达式解析C++注释的公式就是你有
解析（匹配）文件中的每个字符。
如果你直接去评论它将将匹配项注入
错误的地方。

下面是一个很好的解析评论的正则表达式。我最初是从一个 Perl 组中得到的
并针对单行注释和延续略作修改。
有了它，您可以删除评论或仅查找评论。

原始正则表达式：

   # (/\*[^*]*\*+(?:[^/*][^*]*\*+)*/|//(?:[^\]|\\n?)*?\n)|("(?:\[\S\s]|[^"\])*"|'(?:\[\S\s]|[^'\])*'|[\S\s][^/"'\]*)


   (                                # (1 start), Comments 
        /\*                              # Start /* .. */ comment
        [^*]* \*+
        (?: [^/*] [^*]* \*+ )*
        /                                # End /* .. */ comment
     |  
        //                               # Start // comment
        (?: [^\] | \ \n? )*?           # Possible line-continuation
        \n                               # End // comment
   )                                # (1 end)
|  
   (                                # (2 start), Non - comments 
        "
        (?: \ [\S\s] | [^"\] )*        # Double quoted text
        "
     |  '
        (?: \ [\S\s] | [^'\] )*        # Single quoted text
        ' 
     |  [\S\s]                           # Any other char
        [^/"'\]*                        # Chars which doesn't start a comment, string, escape,
                                         # or line continuation (escape + newline)
   )                                # (2 end)

增强（保留格式），主要用于删除评论。
使用多行模式：

   # ((?:(?:^[ \t]*)?(?:/\*[^*]*\*+(?:[^/*][^*]*\*+)*/(?:[ \t]*\r?\n(?=[ \t]*(?:\r?\n|/\*|//)))?|//(?:[^\]|\(?:\r?\n)?)*?(?:\r?\n(?=[ \t]*(?:\r?\n|/\*|//))|(?=\r?\n))))+)|("(?:\[\S\s]|[^"\])*"|'(?:\[\S\s]|[^'\])*'|(?:\r?\n|[\S\s])[^/"'\\s]*)

   (                                # (1 start), Comments 
        (?:
             (?: ^ [ \t]* )?                  # <- To preserve formatting
             (?:
                  /\*                              # Start /* .. */ comment
                  [^*]* \*+
                  (?: [^/*] [^*]* \*+ )*
                  /                                # End /* .. */ comment
                  (?:                              # <- To preserve formatting 
                       [ \t]* \r? \n                                      
                       (?=
                            [ \t]*                  
                            (?: \r? \n | /\* | // )
                       )
                  )?
               |  
                  //                               # Start // comment
                  (?:                              # Possible line-continuation
                       [^\] 
                    |  \ 
                       (?: \r? \n )?
                  )*?
                  (?:                              # End // comment
                       \r? \n                               
                       (?=                              # <- To preserve formatting
                            [ \t]*                          
                            (?: \r? \n | /\* | // )
                       )
                    |  (?= \r? \n )
                  )
             )
        )+                               # Grab multiple comment blocks if need be
   )                                # (1 end)

|                                 ## OR

   (                                # (2 start), Non - comments 
        "
        (?: \ [\S\s] | [^"\] )*        # Double quoted text
        "
     |  '
        (?: \ [\S\s] | [^'\] )*        # Single quoted text
        ' 
     |  (?: \r? \n | [\S\s] )            # Linebreak or Any other char
        [^/"'\\s]*                      # Chars which doesn't start a comment, string, escape,
                                         # or line continuation (escape + newline)
   )                                # (2 end)

为什么在 C++ 中注释掉多行注释不一致？

Why is commenting out multiline comments in c++ inconsistent?

c++

regex

compiler-construction

comments