使用 Perl 检查字符串是否只有英文字符

Question

我有一个包含这样提交的文件

%TRYYVJT128F93506D3<SEP>SOYKCDV12AB0185D99<SEP>Rainie Yang<SEP>Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
%TRYYVHU128F933CCB3<SEP>SOCCHZY12AB0185CE6<SEP>Tepr<SEP>Achète-moi

我正在使用此正则表达式去除歌曲名称以外的所有内容。

$line =~ s/.*>|([([\/\_\-:"``+=*].*)|(feat.*)|[?¿!¡\.;&$@%#\|]//g;

我想确保打印的唯一字符串是仅包含英文字符的字符串，因此在这种情况下，它将是第一首歌曲标题 Ai Wo Quing shut up 而不是下一首，因为 è.

我试过了

if ( $line =~ m/[^a-zA-z0-9_]*$/ ) {
    print $line;
}
else {
    print "Non-english\n";

我以为这只会匹配英文字符，但它总是打印 Non-english。我觉得这是我对正则表达式生疏了，但我找不到答案。

Answer 1

不清楚您到底需要什么，所以这里有一些与您所写内容相关的观察结果。

最好用split来划分<SEP>上的每一行数据，我认为这是一个分隔符。你的问题要求第四个这样的字段，像这样

use strict;
use warnings;
use 5.010;

while ( <DATA> ) {
    chomp;
    my @fields = split /<SEP>/;
    say $fields[3];
}

__DATA__
%TRYYVJT128F93506D3<SEP>SOYKCDV12AB0185D99<SEP>Rainie Yang<SEP>Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
%TRYYVHU128F933CCB3<SEP>SOCCHZY12AB0185CE6<SEP>Tepr<SEP>Achète-moi

输出

Ai Wo Qing shut up (OT: Shotgun(Aka Shot Gun))
Achète-moi

此外，word 字符 class \w 完全匹配 [a-zA-z0-9_]（并且 \W 匹配补码）因此您可以重写 if 这样的语句

if ( $line =~ /\W/ ) {
    print "Non-English\n";
}
else {
    print $line;
}

Answer 2

根据评论，您的问题可能是：

$line =~ m/[^a-zA-z0-9_]*$/

具体来说 - ^ 在方括号内，这意味着它不作为 'anchor'。它实际上是一个否定运算符

参见：http://perldoc.perl.org/perlrecharclass.html#Negation

It is also possible to instead list the characters you do not want to match. You can do so by using a caret (^) as the first character in the character class. For instance, [^a-z] matches any character that is not a lowercase ASCII letter, which therefore includes more than a million Unicode code points. The class is said to be "negated" or "inverted".

但重要的是 - 如果没有 'start of line' 锚点，你的正则表达式是零个或多个实例（无论什么），所以几乎可以匹配任何东西 - 因为它可以自由地忽略该行内容。

（Borodin 的回答涵盖了此类模式匹配的其他一些选项，因此我不会重现）。

使用 Perl 检查字符串是否只有英文字符

Use Perl to check if a string has only English characters

regex

perl