为什么'grep ".h"'和'grep -E ".h"'在同一个文件上的输出不同

Question

假设文件内容如下：

abc.h  
hello world

grep "*.h" file和grep -E "*.h" file的输出不同。根据我的理解，它们应该是相同的。 * 是正则表达式元字符。输出应该都是 abc.h.

输出

grep "*.h" file     # ==> No output
grep -E "*.h" file  # ==> abc.h

请帮助澄清这个问题！

Answer 1

*.h 不应匹配任何一行。如果是，那是因为您特定的 grep 的扩展正则表达式引擎以不同方式处理边界或量词。您可能会在 GNU grep 中看到这种奇怪的行为，但 BSD grep 会正确报告 grep: repetition-operator operand invalid。

您的意思可能是 .*h，无论您使用 BRE 还是 ERE 引擎，它都会匹配这两行。如果你只想从提供的语料库中匹配abc.h，那么你需要：

grep '\.h' /tmp/foo

这将匹配任何带有文字句点后跟字母 h 的行。您甚至可能希望将其锚定在行尾，以确保您不会错误地捕获像 foo abc.h bar 这样的文本。例如：

grep '\.h$' /tmp/foo

Answer 2

-E - 扩展正则表达式，其中*表示前一项将被匹配零次或多次

-G（默认）- 基本正则表达式，其中 * 仅表示 * 字符

-P - Perl 正则表达式，其中 * 与 -E 中的含义相同，但 *.h 编译失败，因为没有可重复的内容（没有字符* 之前）。使用 libpcre:

ldd /bin/grep 
    linux-vdso.so.1 (0x00007ffefddd4000)
    libpcre.so.1 => /lib64/libpcre.so.1 (0x0000003bd8a00000)
    libc.so.6 => /lib64/libc.so.6 (0x0000003bd6a00000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x0000003bd7200000)
    /lib64/ld-linux-x86-64.so.2 (0x0000003bd6600000)

所以 grep -E "*.h" 将匹配任何具有 .h 序列的字符串； grep -G "*.h" 将匹配任何具有 *.h 序列的字符串； grep -P "*.h" 将无法编译。

Answer 3

POSIX 定义 (POSIX) regular expressions 的行为并定义基本正则表达式 (BRE) 和扩展正则表达式 (ERE)。使用 grep -E 需要 ERE；没有 -E，您将获得 BRE（使用 -F，您将无法获得正则表达式）。

BRE 中 * 的 POSIX 定义说：

* The <asterisk> shall be special except when used:

In a bracket expression

As the first character of an entire BRE (after an initial '^', if any)

ERE 中 * 的 POSIX 定义说：

*+?{ The <asterisk>, <plus-sign>, <question-mark>, and <left-brace> shall be special except when used in a bracket expression (see RE Bracket Expression). Any of the following uses produce undefined results:

If these characters appear first in an ERE, or immediately following a <vertical-line>, <circumflex>, or <left-parenthesis>

问题中：

使用 grep '*.h' 是使用 BRE，* 首先出现，因此它不是特殊字符；它匹配 * 后跟任何字符后跟 h.

This would be matched *Zh because the * and the h are separated by one character
This would not be matched because the * and the h are not separated by one character

使用 grep -E '*.h' 调用未定义的行为。任何结果都是有效的。

要可靠地匹配 abc.h 和其他以 .h 结尾的字母数字文件名，您可以使用如下内容：

grep '[[:alnum:]]\.h'

在此上下文中没有特别需要使用 *；如果你这样做了，你可能会写下其中之一：

grep '^[[:alnum:]][[:alnum:]]*\.h$'
grep '^[[:alnum:]]\{1,\}\.h$'

这些查找由一个或多个字母数字组成的行，后跟 . 和 h 以及行尾。如果你不喜欢字符 class 表达式符号（[:alnum:] 部分），你可以这样写：

grep '^[a-zA-Z0-9][a-zA-Z0-9]*\.h$'
grep '^[a-zA-Z0-9]\{1,\}\.h$'

如果你愿意，可以添加下划线：

grep '^[[:alnum:]_][[:alnum:]_]*\.h$'
grep '^[a-zA-Z0-9_][a-zA-Z0-9_]*\.h$'

您还可以使用扩展的正则表达式，例如：

grep '^[[:alnum:]_]+\.h$'
grep '^[a-zA-Z0-9_]+\.h$'

等等。选项很多！

为什么'grep ".h"'和'grep -E ".h"'在同一个文件上的输出不同

why different output of ' grep ".h" ' and ' grep -E ".h" ' on the same file

regex

linux

grep

gnu

为什么'grep "*.h"'和'grep -E "*.h"'在同一个文件上的输出不同

why different output of ' grep "*.h" ' and ' grep -E "*.h" ' on the same file

regex

linux

grep

gnu

为什么'grep ".h"'和'grep -E ".h"'在同一个文件上的输出不同

why different output of ' grep ".h" ' and ' grep -E ".h" ' on the same file