gawk 或 grep:单行且不贪婪

gawk or grep: single line and ungreedy

我想递归地打印所有 sub-directories 文件中的 headers 个 *.java 文件,这些文件具有两个以上的类型参数(即样本中 <R ... H> 内的参数以下)。其中一个文件看起来像(为简洁起见减少了名称):

multiple-lines.java

class ClazzA<R extends A,
    S extends B<T>, T extends C<T>,
    U extends D, W extends E,
    X extends F, Y extends G, Z extends H>
    extends OtherClazz<S> implements I<T> {

  public void method(Type<Q, R> x) { 
    // ... code ...
  }
}

预期输出:

ClazzA.java:10: class ClazzA<R extends A,
ClazzA.java:11:     S extends B<T>, T extends C<T>,
ClazzA.java:12:     U extends D, W extends E,
ClazzA.java:13:     X extends F, Y extends G, Z extends H>
ClazzA.java:14:     extends OtherClazz<S> implements I<T> {

但另一个也可能是这样的:

single-line.java

class ClazzB<R extends A, S extends B<T>, T extends C<T>, U extends D, W extends E, X extends F, Y extends G, Z extends H> extends OtherClazz<S> implements I<T> {

  public void method(Type<Q, R> x) { 
    // ... code ...
  }
}

预期输出:

ClazzB.java:42: class ClazzB<R extends A, S extends B<T>, T extends C<T>, U extends D, W extends E, X extends F, Y extends G, Z extends H> extends OtherClazz<S> implements I<T> {

不应considered/printed的文件:

X-no-parameter.java

class ClazzC /* no type parameter */ extends OtherClazz<S> implements I<T> {

  public void method(Type<A, B> x) { 
    // ... code ...
  }
}

X-one-parameter.java

class ClazzD<R extends A>  // only one type parameter
    extends OtherClazz<S> implements I<T> {

  public void method(Type<X, Y> x) { 
    // ... code ...
  }
}

X-two-parameters.java

class ClazzE<R extends A, S extends B<T>>  // only two type parameters
    extends OtherClazz<S> implements I<T> {

  public void method(Type<X, Y> x) { 
    // ... code ...
  }
}

X-two-line-parameters.java

class ClazzF<R extends A,  // only two type parameters
    S extends B<T>>        // on two lines
    extends OtherClazz<S> implements I<T> {

  public void method(Type<X, Y> x) { 
    // ... code ...
  }
}

文件中的所有空格都可以是\s+。紧接在 { 之前的 extends [...]implements [...] 是可选的。 extends [...] 在每个类型参数中也是可选的。有关详细信息,请参阅 The Java® Language Specification, 8.1. Class Declarations

我在 Git Bash:

中使用 gawk
$ gawk --version
GNU Awk 5.0.0, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0)

与:

find . -type f -name '*.java' | xargs gawk -f ws-class-type-parameter.awk > ws-class-type-parameter.log

ws-class-type-parameter.awk:

# /start/ , /end/ ... pattern

#/class ClazzA<.*,.*/      , /{/  {    # 5 lines, OK for ClazzA, but in real it prints classes with 2 or less type parameters, too
#/class ClazzA<.*,.*,/     , /{/  {    # no line with ClazzA, since there's no second ',' on its first line
#/class ClazzA<.*,.*,/s    , /{/  {    # 500.000+(!) lines
#/class ClazzA<.*,.*,/s    , /{/U {    # 500.000+(!) lines
#/class ClazzA<.*,.*,/sU   , /{/U {    # 500.000+(!) lines
 /(?s)class ClazzA<.*,.*,/ , /{/  {    # no line

    match( FILENAME, "/.*/.." )
    print substr( FILENAME, RLENGTH ) ":" FNR ": " [=20=]
}

这会找到所有 *.java 个文件...很好,对每个文件执行 gawk...很好,但是在我尝试后您会看到结果作为评论。请注意:ClazzA 文字仅用于测试,此处 MCVE。它可能是真实的 \w+,但在测试时在数千个文件中有 500.000 多行...

如果我在 regex101.com 上尝试它,它会起作用。好吧,有点。我在那里找不到如何定义 /start-regex/,/end-regex/,所以我在两者之间添加了另一个 .*

我从那里获取了标志,但找不到 gawk 是否支持标志语法 /.../sU , /.../U 的描述,所以我试了一下。一条现已删除的评论告诉我 awk 的任何风格都不支持这一点。

我也试过 grep:

$ grep --version
grep (GNU grep) 3.1
...
$ grep -nrPf types.grep *.java

types.grep:

(?s).*class\s+\w+\s*<.*,.*,.*>.*{

这只会导致 singleline.java 的输出。

(?s)--perl-regexp, -P 语法,grep --help 声称支持此语法。

更新

Ed Morton 的答案中的解决方案效果很好,但事实证明有 auto-generated 个文件的方法如下:

    /** more code before here */    
    public void setId(String value) {
        this.id = value;
    }

    /**
     * Gets a map that contains attributes that aren't bound to any typed property on this class.
     * 
     * <p>
     * the map is keyed by the name of the attribute and 
     * the value is the string value of the attribute.
     * 
     * the map returned by this method is live, and you can add new attribute
     * by updating the map directly. Because of this design, there's no setter.
     * 
     * 
     * @return
     *     always non-null
     */
    public Map<QName, String> getOtherAttributes() {
        return otherAttributes;
    }

输出例如:

AbstractAddressType.java:81:      * Gets a map that contains attributes that aren't bound to any typed property on this class.
AbstractAddressType.java:82:      * 
AbstractAddressType.java:83:      * <p>
AbstractAddressType.java:84:      * the map is keyed by the name of the attribute and 
AbstractAddressType.java:85:      * the value is the string value of the attribute.
AbstractAddressType.java:86:      * 
AbstractAddressType.java:87:      * the map returned by this method is live, and you can add new attribute
AbstractAddressType.java:88:      * by updating the map directly. Because of this design, there's no setter.
AbstractAddressType.java:89:      * 
AbstractAddressType.java:90:      * 
AbstractAddressType.java:91:      * @return
AbstractAddressType.java:92:      *     always non-null
AbstractAddressType.java:93:      */
AbstractAddressType.java:94:     public Map<QName, String> getOtherAttributes() {

和其他有 class 评论和注释的人,例如:

/**
 * This class was generated by Apache CXF 3.3.4
 * 2020-11-30T12:03:21.251+01:00
 * Generated source version: 3.3.4
 *
 */
@WebService(targetNamespace = "urn:SZRServices", name = "SZR")
@XmlSeeAlso({at.gv.egov.pvp1.ObjectFactory.class, org.w3._2001._04.xmldsig_more_.ObjectFactory.class, ObjectFactory.class, org.xmlsoap.schemas.ws._2002._04.secext.ObjectFactory.class, org.w3._2000._09.xmldsig_.ObjectFactory.class, at.gv.e_government.reference.namespace.persondata._20020228_.ObjectFactory.class})
public interface SZR {
// more code after here

输出为例如:

SZR.java:13:  * This class was generated by Apache CXF 3.3.4
SZR.java:14:  * 2020-10-12T11:51:35.175+02:00
SZR.java:15:  * Generated source version: 3.3.4
SZR.java:16:  *
SZR.java:17:  */
SZR.java:18: @WebService(targetNamespace = "urn:SZRServices", name = "SZR")
SZR.java:19: @XmlSeeAlso({at.gv.egov.pvp1.ObjectFactory.class, org.w3._2001._04.xmldsig_more_.ObjectFactory.class, ObjectFactory.class, org.xmlsoap.schemas.ws._2002._04.secext.ObjectFactory.class, org.w3._2000._09.xmldsig_.ObjectFactory.class, at.gv.e_government.reference.namespace.persondata._20020228_.ObjectFactory.class})

在每个 UNIX 机器上的任何 shell 中使用任何 POSIX awk:

$ cat tst.awk
/[[:space:]]*class[[:space:]]*/ {
    inDef = 1
    fname = FILENAME
    sub(".*/","",fname)
    def = out = ""
}
inDef {
    out = out fname ":" FNR ": " [=10=] ORS

    # Remove comments (not perfect but should work for 99.9% of cases)
    sub("//.*","")
    gsub("/[*]|[*]/","\n")
    gsub(/\n[^\n]*\n/,"")

    def = def [=10=] ORS
    if ( /{/ ) {
        if ( gsub(/,/,"&",def) > 2 ) {
            printf "%s", out
        }
        inDef = 0
    }
}

$ find tmp -type f -name '*.java' -exec awk -f tst.awk {} +
multiple-lines.java:1: class ClazzA<R extends A,
multiple-lines.java:2:     S extends B<T>, T extends C<T>,
multiple-lines.java:3:     U extends D, W extends E,
multiple-lines.java:4:     X extends F, Y extends G, Z extends H>
multiple-lines.java:5:     extends OtherClazz<S> implements I<T> {
single-line.java:1: class ClazzB<R extends A, S extends B<T>, T extends C<T>, U extends D, W extends E, X extends F, Y extends G, Z extends H> extends OtherClazz<S> implements I<T> {

以上是 运行 使用此输入:

$ head tmp/*
==> tmp/X-no-parameter.java <==
class ClazzC /* no type parameter */ extends OtherClazz<S> implements I<T> {

  public void method(Type<A, B> x) {
    // ... code ...
  }
}

==> tmp/X-one-parameter.java <==
class ClazzD<R extends A>  // only one type parameter
    extends OtherClazz<S> implements I<T> {

  public void method(Type<X, Y> x) {
    // ... code ...
  }
}

==> tmp/X-two-line-parameters.java <==
class ClazzF<R extends A,  // only two type parameters
    S extends B<T>>        // on two lines
    extends OtherClazz<S> implements I<T> {

  public void method(Type<X, Y> x) {
    // ... code ...
  }
}

==> tmp/X-two-parameters.java <==
class ClazzE<R extends A, S extends B<T>>  // only two type parameters
    extends OtherClazz<S> implements I<T> {

  public void method(Type<X, Y> x) {
    // ... code ...
  }
}

==> tmp/multiple-lines.java <==
class ClazzA<R extends A,
    S extends B<T>, T extends C<T>,
    U extends D, W extends E,
    X extends F, Y extends G, Z extends H>
    extends OtherClazz<S> implements I<T> {

  public void method(Type<Q, R> x) {
    // ... code ...
  }
}

==> tmp/single-line.java <==
class ClazzB<R extends A, S extends B<T>, T extends C<T>, U extends D, W extends E, X extends F, Y extends G, Z extends H> extends OtherClazz<S> implements I<T> {

  public void method(Type<Q, R> x) {
    // ... code ...
  }
}

以上只是尽力而为,没有为该语言编写解析器,只是让 OP 发布示例 input/output 继续处理需要处理的内容。

注意:注释的存在会导致这些解决方案失败。

ripgrep (https://github.com/BurntSushi/ripgrep)

rg -nU --no-heading '(?s)class\s+\w+\s*<[^{]*,[^{]*,[^{]*>[^{]*\{' *.java
  • -n 启用行编号(如果输出到终端,这是默认设置)
  • -U 启用多行匹配
  • --no-heading 默认情况下,ripgrep 显示在文件名下分组的匹配行作为 header,此选项使 ripgrep 的行为类似于带有文件名前缀的 GNU grep对于每个输出行
  • 使用
  • [^{]* 而不是 .* 来防止匹配文件中其他地方的 ,>,否则像 public void method(Type<Q, R> x) { 这样的行将被匹配
  • -m 选项可用于限制每个输入文件的匹配数,这将带来不必搜索整个输入文件的额外好处

如果将上述正则表达式与 GNU grep 一起使用,请注意:

  • grep 一次只匹配一行。如果您使用 -z 选项,grep 会将 ASCII NUL 视为记录分隔符,这有效地使您能够跨多行进行匹配,假设输入没有 NUL 字符可以阻止此类匹配。 -z 选项的另一个影响是 NUL 字符将附加到每个输出结果(这可以通过管道结果到 tr '[=29=]' '\n' 来修复)
  • -o 选项将只打印匹配的部分,这意味着您将无法获得行号前缀
  • 对于给定的任务,不需要 -Pgrep -zoE 'class\s+\w+\s*<[^{]*,[^{]*,[^{]*>[^{]*\{' *.java | tr '[=32=]' '\n' 会给你与 ripgrep 命令类似的结果。但是,您不会获得行号前缀,文件名前缀将仅适用于每个匹配部分而不是每个匹配行,并且您不会获得 class 之前和 {[=56 之后的其余行=]