gawk 或 grep:单行且不贪婪
gawk or grep: single line and ungreedy
我想递归地打印所有 sub-directories 文件中的 headers 个 *.java
文件,这些文件具有两个以上的类型参数(即样本中 <R ... H>
内的参数以下)。其中一个文件看起来像(为简洁起见减少了名称):
multiple-lines.java
class ClazzA<R extends A,
S extends B<T>, T extends C<T>,
U extends D, W extends E,
X extends F, Y extends G, Z extends H>
extends OtherClazz<S> implements I<T> {
public void method(Type<Q, R> x) {
// ... code ...
}
}
预期输出:
ClazzA.java:10: class ClazzA<R extends A,
ClazzA.java:11: S extends B<T>, T extends C<T>,
ClazzA.java:12: U extends D, W extends E,
ClazzA.java:13: X extends F, Y extends G, Z extends H>
ClazzA.java:14: extends OtherClazz<S> implements I<T> {
但另一个也可能是这样的:
single-line.java
class ClazzB<R extends A, S extends B<T>, T extends C<T>, U extends D, W extends E, X extends F, Y extends G, Z extends H> extends OtherClazz<S> implements I<T> {
public void method(Type<Q, R> x) {
// ... code ...
}
}
预期输出:
ClazzB.java:42: class ClazzB<R extends A, S extends B<T>, T extends C<T>, U extends D, W extends E, X extends F, Y extends G, Z extends H> extends OtherClazz<S> implements I<T> {
不应considered/printed的文件:
X-no-parameter.java
class ClazzC /* no type parameter */ extends OtherClazz<S> implements I<T> {
public void method(Type<A, B> x) {
// ... code ...
}
}
X-one-parameter.java
class ClazzD<R extends A> // only one type parameter
extends OtherClazz<S> implements I<T> {
public void method(Type<X, Y> x) {
// ... code ...
}
}
X-two-parameters.java
class ClazzE<R extends A, S extends B<T>> // only two type parameters
extends OtherClazz<S> implements I<T> {
public void method(Type<X, Y> x) {
// ... code ...
}
}
X-two-line-parameters.java
class ClazzF<R extends A, // only two type parameters
S extends B<T>> // on two lines
extends OtherClazz<S> implements I<T> {
public void method(Type<X, Y> x) {
// ... code ...
}
}
文件中的所有空格都可以是\s+
。紧接在 {
之前的 extends [...]
和 implements [...]
是可选的。 extends [...]
在每个类型参数中也是可选的。有关详细信息,请参阅 The Java® Language Specification, 8.1. Class Declarations。
我在 Git Bash:
中使用 gawk
$ gawk --version
GNU Awk 5.0.0, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0)
与:
find . -type f -name '*.java' | xargs gawk -f ws-class-type-parameter.awk > ws-class-type-parameter.log
和ws-class-type-parameter.awk
:
# /start/ , /end/ ... pattern
#/class ClazzA<.*,.*/ , /{/ { # 5 lines, OK for ClazzA, but in real it prints classes with 2 or less type parameters, too
#/class ClazzA<.*,.*,/ , /{/ { # no line with ClazzA, since there's no second ',' on its first line
#/class ClazzA<.*,.*,/s , /{/ { # 500.000+(!) lines
#/class ClazzA<.*,.*,/s , /{/U { # 500.000+(!) lines
#/class ClazzA<.*,.*,/sU , /{/U { # 500.000+(!) lines
/(?s)class ClazzA<.*,.*,/ , /{/ { # no line
match( FILENAME, "/.*/.." )
print substr( FILENAME, RLENGTH ) ":" FNR ": " [=20=]
}
这会找到所有 *.java
个文件...很好,对每个文件执行 gawk
...很好,但是在我尝试后您会看到结果作为评论。请注意:ClazzA
文字仅用于测试,此处 MCVE。它可能是真实的 \w+
,但在测试时在数千个文件中有 500.000 多行...
如果我在 regex101.com 上尝试它,它会起作用。好吧,有点。我在那里找不到如何定义 /start-regex/,/end-regex/
,所以我在两者之间添加了另一个 .*
。
我从那里获取了标志,但找不到 gawk
是否支持标志语法 /.../sU , /.../U
的描述,所以我试了一下。一条现已删除的评论告诉我 awk
的任何风格都不支持这一点。
我也试过 grep
:
$ grep --version
grep (GNU grep) 3.1
...
$ grep -nrPf types.grep *.java
与types.grep:
(?s).*class\s+\w+\s*<.*,.*,.*>.*{
这只会导致 singleline.java 的输出。
(?s)
是 --perl-regexp, -P
语法,grep --help
声称支持此语法。
更新
Ed Morton 的答案中的解决方案效果很好,但事实证明有 auto-generated 个文件的方法如下:
/** more code before here */
public void setId(String value) {
this.id = value;
}
/**
* Gets a map that contains attributes that aren't bound to any typed property on this class.
*
* <p>
* the map is keyed by the name of the attribute and
* the value is the string value of the attribute.
*
* the map returned by this method is live, and you can add new attribute
* by updating the map directly. Because of this design, there's no setter.
*
*
* @return
* always non-null
*/
public Map<QName, String> getOtherAttributes() {
return otherAttributes;
}
输出例如:
AbstractAddressType.java:81: * Gets a map that contains attributes that aren't bound to any typed property on this class.
AbstractAddressType.java:82: *
AbstractAddressType.java:83: * <p>
AbstractAddressType.java:84: * the map is keyed by the name of the attribute and
AbstractAddressType.java:85: * the value is the string value of the attribute.
AbstractAddressType.java:86: *
AbstractAddressType.java:87: * the map returned by this method is live, and you can add new attribute
AbstractAddressType.java:88: * by updating the map directly. Because of this design, there's no setter.
AbstractAddressType.java:89: *
AbstractAddressType.java:90: *
AbstractAddressType.java:91: * @return
AbstractAddressType.java:92: * always non-null
AbstractAddressType.java:93: */
AbstractAddressType.java:94: public Map<QName, String> getOtherAttributes() {
和其他有 class 评论和注释的人,例如:
/**
* This class was generated by Apache CXF 3.3.4
* 2020-11-30T12:03:21.251+01:00
* Generated source version: 3.3.4
*
*/
@WebService(targetNamespace = "urn:SZRServices", name = "SZR")
@XmlSeeAlso({at.gv.egov.pvp1.ObjectFactory.class, org.w3._2001._04.xmldsig_more_.ObjectFactory.class, ObjectFactory.class, org.xmlsoap.schemas.ws._2002._04.secext.ObjectFactory.class, org.w3._2000._09.xmldsig_.ObjectFactory.class, at.gv.e_government.reference.namespace.persondata._20020228_.ObjectFactory.class})
public interface SZR {
// more code after here
输出为例如:
SZR.java:13: * This class was generated by Apache CXF 3.3.4
SZR.java:14: * 2020-10-12T11:51:35.175+02:00
SZR.java:15: * Generated source version: 3.3.4
SZR.java:16: *
SZR.java:17: */
SZR.java:18: @WebService(targetNamespace = "urn:SZRServices", name = "SZR")
SZR.java:19: @XmlSeeAlso({at.gv.egov.pvp1.ObjectFactory.class, org.w3._2001._04.xmldsig_more_.ObjectFactory.class, ObjectFactory.class, org.xmlsoap.schemas.ws._2002._04.secext.ObjectFactory.class, org.w3._2000._09.xmldsig_.ObjectFactory.class, at.gv.e_government.reference.namespace.persondata._20020228_.ObjectFactory.class})
在每个 UNIX 机器上的任何 shell 中使用任何 POSIX awk:
$ cat tst.awk
/[[:space:]]*class[[:space:]]*/ {
inDef = 1
fname = FILENAME
sub(".*/","",fname)
def = out = ""
}
inDef {
out = out fname ":" FNR ": " [=10=] ORS
# Remove comments (not perfect but should work for 99.9% of cases)
sub("//.*","")
gsub("/[*]|[*]/","\n")
gsub(/\n[^\n]*\n/,"")
def = def [=10=] ORS
if ( /{/ ) {
if ( gsub(/,/,"&",def) > 2 ) {
printf "%s", out
}
inDef = 0
}
}
$ find tmp -type f -name '*.java' -exec awk -f tst.awk {} +
multiple-lines.java:1: class ClazzA<R extends A,
multiple-lines.java:2: S extends B<T>, T extends C<T>,
multiple-lines.java:3: U extends D, W extends E,
multiple-lines.java:4: X extends F, Y extends G, Z extends H>
multiple-lines.java:5: extends OtherClazz<S> implements I<T> {
single-line.java:1: class ClazzB<R extends A, S extends B<T>, T extends C<T>, U extends D, W extends E, X extends F, Y extends G, Z extends H> extends OtherClazz<S> implements I<T> {
以上是 运行 使用此输入:
$ head tmp/*
==> tmp/X-no-parameter.java <==
class ClazzC /* no type parameter */ extends OtherClazz<S> implements I<T> {
public void method(Type<A, B> x) {
// ... code ...
}
}
==> tmp/X-one-parameter.java <==
class ClazzD<R extends A> // only one type parameter
extends OtherClazz<S> implements I<T> {
public void method(Type<X, Y> x) {
// ... code ...
}
}
==> tmp/X-two-line-parameters.java <==
class ClazzF<R extends A, // only two type parameters
S extends B<T>> // on two lines
extends OtherClazz<S> implements I<T> {
public void method(Type<X, Y> x) {
// ... code ...
}
}
==> tmp/X-two-parameters.java <==
class ClazzE<R extends A, S extends B<T>> // only two type parameters
extends OtherClazz<S> implements I<T> {
public void method(Type<X, Y> x) {
// ... code ...
}
}
==> tmp/multiple-lines.java <==
class ClazzA<R extends A,
S extends B<T>, T extends C<T>,
U extends D, W extends E,
X extends F, Y extends G, Z extends H>
extends OtherClazz<S> implements I<T> {
public void method(Type<Q, R> x) {
// ... code ...
}
}
==> tmp/single-line.java <==
class ClazzB<R extends A, S extends B<T>, T extends C<T>, U extends D, W extends E, X extends F, Y extends G, Z extends H> extends OtherClazz<S> implements I<T> {
public void method(Type<Q, R> x) {
// ... code ...
}
}
以上只是尽力而为,没有为该语言编写解析器,只是让 OP 发布示例 input/output 继续处理需要处理的内容。
注意:注释的存在会导致这些解决方案失败。
与ripgrep
(https://github.com/BurntSushi/ripgrep)
rg -nU --no-heading '(?s)class\s+\w+\s*<[^{]*,[^{]*,[^{]*>[^{]*\{' *.java
-n
启用行编号(如果输出到终端,这是默认设置)
-U
启用多行匹配
--no-heading
默认情况下,ripgrep
显示在文件名下分组的匹配行作为 header,此选项使 ripgrep
的行为类似于带有文件名前缀的 GNU grep
对于每个输出行
使用 [^{]*
而不是 .*
来防止匹配文件中其他地方的 ,
和 >
,否则像 public void method(Type<Q, R> x) {
这样的行将被匹配
-m
选项可用于限制每个输入文件的匹配数,这将带来不必搜索整个输入文件的额外好处
如果将上述正则表达式与 GNU grep
一起使用,请注意:
grep
一次只匹配一行。如果您使用 -z
选项,grep
会将 ASCII NUL 视为记录分隔符,这有效地使您能够跨多行进行匹配,假设输入没有 NUL 字符可以阻止此类匹配。 -z
选项的另一个影响是 NUL 字符将附加到每个输出结果(这可以通过管道结果到 tr '[=29=]' '\n'
来修复)
-o
选项将只打印匹配的部分,这意味着您将无法获得行号前缀
- 对于给定的任务,不需要
-P
,grep -zoE 'class\s+\w+\s*<[^{]*,[^{]*,[^{]*>[^{]*\{' *.java | tr '[=32=]' '\n'
会给你与 ripgrep
命令类似的结果。但是,您不会获得行号前缀,文件名前缀将仅适用于每个匹配部分而不是每个匹配行,并且您不会获得 class
之前和 {
[=56 之后的其余行=]
我想递归地打印所有 sub-directories 文件中的 headers 个 *.java
文件,这些文件具有两个以上的类型参数(即样本中 <R ... H>
内的参数以下)。其中一个文件看起来像(为简洁起见减少了名称):
multiple-lines.java
class ClazzA<R extends A,
S extends B<T>, T extends C<T>,
U extends D, W extends E,
X extends F, Y extends G, Z extends H>
extends OtherClazz<S> implements I<T> {
public void method(Type<Q, R> x) {
// ... code ...
}
}
预期输出:
ClazzA.java:10: class ClazzA<R extends A,
ClazzA.java:11: S extends B<T>, T extends C<T>,
ClazzA.java:12: U extends D, W extends E,
ClazzA.java:13: X extends F, Y extends G, Z extends H>
ClazzA.java:14: extends OtherClazz<S> implements I<T> {
但另一个也可能是这样的:
single-line.java
class ClazzB<R extends A, S extends B<T>, T extends C<T>, U extends D, W extends E, X extends F, Y extends G, Z extends H> extends OtherClazz<S> implements I<T> {
public void method(Type<Q, R> x) {
// ... code ...
}
}
预期输出:
ClazzB.java:42: class ClazzB<R extends A, S extends B<T>, T extends C<T>, U extends D, W extends E, X extends F, Y extends G, Z extends H> extends OtherClazz<S> implements I<T> {
不应considered/printed的文件:
X-no-parameter.java
class ClazzC /* no type parameter */ extends OtherClazz<S> implements I<T> {
public void method(Type<A, B> x) {
// ... code ...
}
}
X-one-parameter.java
class ClazzD<R extends A> // only one type parameter
extends OtherClazz<S> implements I<T> {
public void method(Type<X, Y> x) {
// ... code ...
}
}
X-two-parameters.java
class ClazzE<R extends A, S extends B<T>> // only two type parameters
extends OtherClazz<S> implements I<T> {
public void method(Type<X, Y> x) {
// ... code ...
}
}
X-two-line-parameters.java
class ClazzF<R extends A, // only two type parameters
S extends B<T>> // on two lines
extends OtherClazz<S> implements I<T> {
public void method(Type<X, Y> x) {
// ... code ...
}
}
文件中的所有空格都可以是\s+
。紧接在 {
之前的 extends [...]
和 implements [...]
是可选的。 extends [...]
在每个类型参数中也是可选的。有关详细信息,请参阅 The Java® Language Specification, 8.1. Class Declarations。
我在 Git Bash:
中使用gawk
$ gawk --version
GNU Awk 5.0.0, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0)
与:
find . -type f -name '*.java' | xargs gawk -f ws-class-type-parameter.awk > ws-class-type-parameter.log
和ws-class-type-parameter.awk
:
# /start/ , /end/ ... pattern
#/class ClazzA<.*,.*/ , /{/ { # 5 lines, OK for ClazzA, but in real it prints classes with 2 or less type parameters, too
#/class ClazzA<.*,.*,/ , /{/ { # no line with ClazzA, since there's no second ',' on its first line
#/class ClazzA<.*,.*,/s , /{/ { # 500.000+(!) lines
#/class ClazzA<.*,.*,/s , /{/U { # 500.000+(!) lines
#/class ClazzA<.*,.*,/sU , /{/U { # 500.000+(!) lines
/(?s)class ClazzA<.*,.*,/ , /{/ { # no line
match( FILENAME, "/.*/.." )
print substr( FILENAME, RLENGTH ) ":" FNR ": " [=20=]
}
这会找到所有 *.java
个文件...很好,对每个文件执行 gawk
...很好,但是在我尝试后您会看到结果作为评论。请注意:ClazzA
文字仅用于测试,此处 MCVE。它可能是真实的 \w+
,但在测试时在数千个文件中有 500.000 多行...
如果我在 regex101.com 上尝试它,它会起作用。好吧,有点。我在那里找不到如何定义 /start-regex/,/end-regex/
,所以我在两者之间添加了另一个 .*
。
我从那里获取了标志,但找不到 gawk
是否支持标志语法 /.../sU , /.../U
的描述,所以我试了一下。一条现已删除的评论告诉我 awk
的任何风格都不支持这一点。
我也试过 grep
:
$ grep --version
grep (GNU grep) 3.1
...
$ grep -nrPf types.grep *.java
与types.grep:
(?s).*class\s+\w+\s*<.*,.*,.*>.*{
这只会导致 singleline.java 的输出。
(?s)
是 --perl-regexp, -P
语法,grep --help
声称支持此语法。
更新
Ed Morton 的答案中的解决方案效果很好,但事实证明有 auto-generated 个文件的方法如下:
/** more code before here */
public void setId(String value) {
this.id = value;
}
/**
* Gets a map that contains attributes that aren't bound to any typed property on this class.
*
* <p>
* the map is keyed by the name of the attribute and
* the value is the string value of the attribute.
*
* the map returned by this method is live, and you can add new attribute
* by updating the map directly. Because of this design, there's no setter.
*
*
* @return
* always non-null
*/
public Map<QName, String> getOtherAttributes() {
return otherAttributes;
}
输出例如:
AbstractAddressType.java:81: * Gets a map that contains attributes that aren't bound to any typed property on this class.
AbstractAddressType.java:82: *
AbstractAddressType.java:83: * <p>
AbstractAddressType.java:84: * the map is keyed by the name of the attribute and
AbstractAddressType.java:85: * the value is the string value of the attribute.
AbstractAddressType.java:86: *
AbstractAddressType.java:87: * the map returned by this method is live, and you can add new attribute
AbstractAddressType.java:88: * by updating the map directly. Because of this design, there's no setter.
AbstractAddressType.java:89: *
AbstractAddressType.java:90: *
AbstractAddressType.java:91: * @return
AbstractAddressType.java:92: * always non-null
AbstractAddressType.java:93: */
AbstractAddressType.java:94: public Map<QName, String> getOtherAttributes() {
和其他有 class 评论和注释的人,例如:
/**
* This class was generated by Apache CXF 3.3.4
* 2020-11-30T12:03:21.251+01:00
* Generated source version: 3.3.4
*
*/
@WebService(targetNamespace = "urn:SZRServices", name = "SZR")
@XmlSeeAlso({at.gv.egov.pvp1.ObjectFactory.class, org.w3._2001._04.xmldsig_more_.ObjectFactory.class, ObjectFactory.class, org.xmlsoap.schemas.ws._2002._04.secext.ObjectFactory.class, org.w3._2000._09.xmldsig_.ObjectFactory.class, at.gv.e_government.reference.namespace.persondata._20020228_.ObjectFactory.class})
public interface SZR {
// more code after here
输出为例如:
SZR.java:13: * This class was generated by Apache CXF 3.3.4
SZR.java:14: * 2020-10-12T11:51:35.175+02:00
SZR.java:15: * Generated source version: 3.3.4
SZR.java:16: *
SZR.java:17: */
SZR.java:18: @WebService(targetNamespace = "urn:SZRServices", name = "SZR")
SZR.java:19: @XmlSeeAlso({at.gv.egov.pvp1.ObjectFactory.class, org.w3._2001._04.xmldsig_more_.ObjectFactory.class, ObjectFactory.class, org.xmlsoap.schemas.ws._2002._04.secext.ObjectFactory.class, org.w3._2000._09.xmldsig_.ObjectFactory.class, at.gv.e_government.reference.namespace.persondata._20020228_.ObjectFactory.class})
在每个 UNIX 机器上的任何 shell 中使用任何 POSIX awk:
$ cat tst.awk
/[[:space:]]*class[[:space:]]*/ {
inDef = 1
fname = FILENAME
sub(".*/","",fname)
def = out = ""
}
inDef {
out = out fname ":" FNR ": " [=10=] ORS
# Remove comments (not perfect but should work for 99.9% of cases)
sub("//.*","")
gsub("/[*]|[*]/","\n")
gsub(/\n[^\n]*\n/,"")
def = def [=10=] ORS
if ( /{/ ) {
if ( gsub(/,/,"&",def) > 2 ) {
printf "%s", out
}
inDef = 0
}
}
$ find tmp -type f -name '*.java' -exec awk -f tst.awk {} +
multiple-lines.java:1: class ClazzA<R extends A,
multiple-lines.java:2: S extends B<T>, T extends C<T>,
multiple-lines.java:3: U extends D, W extends E,
multiple-lines.java:4: X extends F, Y extends G, Z extends H>
multiple-lines.java:5: extends OtherClazz<S> implements I<T> {
single-line.java:1: class ClazzB<R extends A, S extends B<T>, T extends C<T>, U extends D, W extends E, X extends F, Y extends G, Z extends H> extends OtherClazz<S> implements I<T> {
以上是 运行 使用此输入:
$ head tmp/*
==> tmp/X-no-parameter.java <==
class ClazzC /* no type parameter */ extends OtherClazz<S> implements I<T> {
public void method(Type<A, B> x) {
// ... code ...
}
}
==> tmp/X-one-parameter.java <==
class ClazzD<R extends A> // only one type parameter
extends OtherClazz<S> implements I<T> {
public void method(Type<X, Y> x) {
// ... code ...
}
}
==> tmp/X-two-line-parameters.java <==
class ClazzF<R extends A, // only two type parameters
S extends B<T>> // on two lines
extends OtherClazz<S> implements I<T> {
public void method(Type<X, Y> x) {
// ... code ...
}
}
==> tmp/X-two-parameters.java <==
class ClazzE<R extends A, S extends B<T>> // only two type parameters
extends OtherClazz<S> implements I<T> {
public void method(Type<X, Y> x) {
// ... code ...
}
}
==> tmp/multiple-lines.java <==
class ClazzA<R extends A,
S extends B<T>, T extends C<T>,
U extends D, W extends E,
X extends F, Y extends G, Z extends H>
extends OtherClazz<S> implements I<T> {
public void method(Type<Q, R> x) {
// ... code ...
}
}
==> tmp/single-line.java <==
class ClazzB<R extends A, S extends B<T>, T extends C<T>, U extends D, W extends E, X extends F, Y extends G, Z extends H> extends OtherClazz<S> implements I<T> {
public void method(Type<Q, R> x) {
// ... code ...
}
}
以上只是尽力而为,没有为该语言编写解析器,只是让 OP 发布示例 input/output 继续处理需要处理的内容。
注意:注释的存在会导致这些解决方案失败。
与ripgrep
(https://github.com/BurntSushi/ripgrep)
rg -nU --no-heading '(?s)class\s+\w+\s*<[^{]*,[^{]*,[^{]*>[^{]*\{' *.java
-n
启用行编号(如果输出到终端,这是默认设置)-U
启用多行匹配--no-heading
默认情况下,ripgrep
显示在文件名下分组的匹配行作为 header,此选项使ripgrep
的行为类似于带有文件名前缀的GNU grep
对于每个输出行
使用 [^{]*
而不是.*
来防止匹配文件中其他地方的,
和>
,否则像public void method(Type<Q, R> x) {
这样的行将被匹配-m
选项可用于限制每个输入文件的匹配数,这将带来不必搜索整个输入文件的额外好处
如果将上述正则表达式与 GNU grep
一起使用,请注意:
grep
一次只匹配一行。如果您使用-z
选项,grep
会将 ASCII NUL 视为记录分隔符,这有效地使您能够跨多行进行匹配,假设输入没有 NUL 字符可以阻止此类匹配。-z
选项的另一个影响是 NUL 字符将附加到每个输出结果(这可以通过管道结果到tr '[=29=]' '\n'
来修复)-o
选项将只打印匹配的部分,这意味着您将无法获得行号前缀- 对于给定的任务,不需要
-P
,grep -zoE 'class\s+\w+\s*<[^{]*,[^{]*,[^{]*>[^{]*\{' *.java | tr '[=32=]' '\n'
会给你与ripgrep
命令类似的结果。但是,您不会获得行号前缀,文件名前缀将仅适用于每个匹配部分而不是每个匹配行,并且您不会获得class
之前和{
[=56 之后的其余行=]