Ubuntu: Perl 误读了带有西里尔字符的文件名
Ubuntu: Perl is misreading filenames with Cyrillic characters
我有很多文件的文件名都是西里尔字母,例如 Deceasedя0я0.25я3.xgboost.json
我用一个函数读入了这些文件:
use Devel::Confess 'color'
use utf8;
use autodie ':all';
use open ':std', ':encoding(UTF-8)';
sub json_file_to_ref {
my $json_filename = shift;
open my $fh, '<:raw', $json_filename; # Read it unmangled
local $/; # Read whole file
my $json = <$fh>; # This is UTF-8
my $ref = decode_json($json); # This produces decoded text
return $ref; # Return the ref rather than the keys and values.
}
我从
得到的
但问题是 Perl 会像 DeceasedÑ0Ñ0.2Ñ3.xgboost.json
一样读取文件,即将 я
翻译成 Ñ
,这意味着当我进行正则表达式搜索时文件不会显示.
文件名是这样读取的:
sub list_regex_files {
my $regex = shift;
my $directory = '.';
if (defined $_[0]) {
$directory = shift
}
my @files;
opendir (my $dh, $directory);
$regex = qr/$regex/;
while (my $file = readdir $dh) {
if ($file !~ $regex) {
next
}
if ($file =~ m/^\.{1,2}$/) {
next
}
my $f = "$directory/$file";
if (-f $f) {
if ($directory eq '.') {
push @files, $file
} else {
push @files, $f
}
}
}
@files
}
但是,如果我注释掉可以让文件显示在正则表达式搜索中
use utf8;
use open ':std', ':encoding(UTF-8)';
但是当我尝试读取文件时(以下错误是针对不同的文件),
Use of uninitialized value $/ in string eq at /home/con/perl5/perlbrew/perls/perl-5.32.1/lib/5.32.1/Carp.pm line 605, <$_[...]> chunk 1.
Use of uninitialized value $/ in string eq at /home/con/perl5/perlbrew/perls/perl-5.32.1/lib/5.32.1/Carp.pm line 605, <$_[...]> chunk 1.
Use of uninitialized value $/ in string eq at 4.best.params.pl line 32, <$_[...]> chunk 1.
main::json_file_to_ref("data/Deceased\x{d1}\x{8f}0\x{d1}\x{8f}0.15\x{d1}\x{8f}3.xgboost.json") called at 4.best.params.pl line 140
我看过类似 How do I write a file whose *filename* contains utf8 characters in Perl? and Perl newbie first experience with Unicode (in filename, -e operator, open operator, and cmd window) 的帖子,但我没有使用 Windows。
我也试过use feature 'unicode_strings'
没用。
我也试过了
use Encode 'decode_utf8';
sub json_file_to_ref {
my $json_filename = shift;
open my $fh, '<:raw', decode_utf8($json_filename); # Read it unmangled
local $/; # Read whole file
my $json = <$fh>; # This is UTF-8
my $ref = decode_json($json); # This produces decoded text
return $ref; # Return the ref rather than the keys and values.
}
但这会产生相同的错误消息。
我也试过了
use open ':std', ':encoding(cp866)';
use open IO => ':encoding(cp1251)';
如 Reading Cyrillic characters from file in perl
中所建议
但这也失败了。
如何让 Linux Perl 读取通过该子例程写入的文件名?
正如@Ed Sabol 指出的那样,问题在于文件字符以及文件的读取方式。
要更改的关键行是 readdir $dh
到 decode_utf8(readdir $dh)
这允许 Perl 处理非拉丁(西里尔)文件名。还应加载编码库:use Encode 'decode_utf8';
#!/usr/bin/env perl
use strict;
use warnings FATAL => 'all';
use autodie ':all';
use Devel::Confess 'color';
use feature 'say';
use JSON 'decode_json';
use utf8;
use DDP;
use Devel::Confess 'color';
use Encode 'decode_utf8'; # necessary for Cyrillic characters
use open ':std', ':encoding(UTF-8)'; # For say to STDOUT. Also default for open()
sub json_file_to_ref {
my $json_filename = shift;
open my $fh, '<:raw', $json_filename; # Read it unmangled
local $/; # Read whole file
my $json = <$fh>; # This is UTF-8
my $ref = decode_json($json); # This produces decoded text
return $ref; # Return the ref rather than the keys and values.
}
sub list_regex_files {
my $regex = shift;
my $directory = '.';
if (defined $_[0]) {
$directory = shift
}
my @files;
opendir (my $dh, $directory);
$regex = qr/$regex/;
while (my $file = decode_utf8(readdir $dh)) {
if ($file !~ $regex) {
next
}
if ($file =~ m/^\.{1,2}$/) {
next
}
my $f = "$directory/$file";
if (-f $f) {
if ($directory eq '.') {
push @files, $file
} else {
push @files, $f
}
}
}
@files
}
my @files = list_regex_files('я.json$');
p @files;
my $data = json_file_to_ref('я.json');
p $data;
顺便说一句,随着 Perl7 即将推出,非拉丁字符处理似乎是一个明智的默认设置,应该更改
我有很多文件的文件名都是西里尔字母,例如 Deceasedя0я0.25я3.xgboost.json
我用一个函数读入了这些文件:
use Devel::Confess 'color'
use utf8;
use autodie ':all';
use open ':std', ':encoding(UTF-8)';
sub json_file_to_ref {
my $json_filename = shift;
open my $fh, '<:raw', $json_filename; # Read it unmangled
local $/; # Read whole file
my $json = <$fh>; # This is UTF-8
my $ref = decode_json($json); # This produces decoded text
return $ref; # Return the ref rather than the keys and values.
}
我从
但问题是 Perl 会像 DeceasedÑ0Ñ0.2Ñ3.xgboost.json
一样读取文件,即将 я
翻译成 Ñ
,这意味着当我进行正则表达式搜索时文件不会显示.
文件名是这样读取的:
sub list_regex_files {
my $regex = shift;
my $directory = '.';
if (defined $_[0]) {
$directory = shift
}
my @files;
opendir (my $dh, $directory);
$regex = qr/$regex/;
while (my $file = readdir $dh) {
if ($file !~ $regex) {
next
}
if ($file =~ m/^\.{1,2}$/) {
next
}
my $f = "$directory/$file";
if (-f $f) {
if ($directory eq '.') {
push @files, $file
} else {
push @files, $f
}
}
}
@files
}
但是,如果我注释掉可以让文件显示在正则表达式搜索中
use utf8;
use open ':std', ':encoding(UTF-8)';
但是当我尝试读取文件时(以下错误是针对不同的文件),
Use of uninitialized value $/ in string eq at /home/con/perl5/perlbrew/perls/perl-5.32.1/lib/5.32.1/Carp.pm line 605, <$_[...]> chunk 1.
Use of uninitialized value $/ in string eq at /home/con/perl5/perlbrew/perls/perl-5.32.1/lib/5.32.1/Carp.pm line 605, <$_[...]> chunk 1.
Use of uninitialized value $/ in string eq at 4.best.params.pl line 32, <$_[...]> chunk 1.
main::json_file_to_ref("data/Deceased\x{d1}\x{8f}0\x{d1}\x{8f}0.15\x{d1}\x{8f}3.xgboost.json") called at 4.best.params.pl line 140
我看过类似 How do I write a file whose *filename* contains utf8 characters in Perl? and Perl newbie first experience with Unicode (in filename, -e operator, open operator, and cmd window) 的帖子,但我没有使用 Windows。
我也试过use feature 'unicode_strings'
没用。
我也试过了
use Encode 'decode_utf8';
sub json_file_to_ref {
my $json_filename = shift;
open my $fh, '<:raw', decode_utf8($json_filename); # Read it unmangled
local $/; # Read whole file
my $json = <$fh>; # This is UTF-8
my $ref = decode_json($json); # This produces decoded text
return $ref; # Return the ref rather than the keys and values.
}
但这会产生相同的错误消息。
我也试过了
use open ':std', ':encoding(cp866)';
use open IO => ':encoding(cp1251)';
如 Reading Cyrillic characters from file in perl
中所建议但这也失败了。
如何让 Linux Perl 读取通过该子例程写入的文件名?
正如@Ed Sabol 指出的那样,问题在于文件字符以及文件的读取方式。
要更改的关键行是 readdir $dh
到 decode_utf8(readdir $dh)
这允许 Perl 处理非拉丁(西里尔)文件名。还应加载编码库:use Encode 'decode_utf8';
#!/usr/bin/env perl
use strict;
use warnings FATAL => 'all';
use autodie ':all';
use Devel::Confess 'color';
use feature 'say';
use JSON 'decode_json';
use utf8;
use DDP;
use Devel::Confess 'color';
use Encode 'decode_utf8'; # necessary for Cyrillic characters
use open ':std', ':encoding(UTF-8)'; # For say to STDOUT. Also default for open()
sub json_file_to_ref {
my $json_filename = shift;
open my $fh, '<:raw', $json_filename; # Read it unmangled
local $/; # Read whole file
my $json = <$fh>; # This is UTF-8
my $ref = decode_json($json); # This produces decoded text
return $ref; # Return the ref rather than the keys and values.
}
sub list_regex_files {
my $regex = shift;
my $directory = '.';
if (defined $_[0]) {
$directory = shift
}
my @files;
opendir (my $dh, $directory);
$regex = qr/$regex/;
while (my $file = decode_utf8(readdir $dh)) {
if ($file !~ $regex) {
next
}
if ($file =~ m/^\.{1,2}$/) {
next
}
my $f = "$directory/$file";
if (-f $f) {
if ($directory eq '.') {
push @files, $file
} else {
push @files, $f
}
}
}
@files
}
my @files = list_regex_files('я.json$');
p @files;
my $data = json_file_to_ref('я.json');
p $data;
顺便说一句,随着 Perl7 即将推出,非拉丁字符处理似乎是一个明智的默认设置,应该更改