在大文件中查找单词并复制包含该单词的行
Find the word in large file and copy the line which contains that word
我有两个文件,即 File_A 和 File_B。 File_A 每行包含一个单词,File_B 包含句子。我必须从 File_A 中读取单词并搜索 File_B 中以该单词开头的行并将整行复制到 File_C。 File_A 和 File_B 都已排序
例如
File_A :
he
I
there
File_B :
he was at least equally intrigued by hers.
I guess he's going to use it in his business.
I don't know if he's angry or not.
there were five dogs.
there is fly in my soup.
we don't know what he is doing.
File_C:
he was at least equally intrigued by hers.
I guess he's going to use it in his business.
I don't know if he's angry or not.
there were five dogs.
there is fly in my soup.
我尝试使用 shell 脚本,但它是启发式方法,因此需要很长时间。 File_A 和 File_B 都是大文件。
这是我试过的代码
#! /bin/bash
for first in `cat File_A`
do
while read line
do
first_col=$(echo $line|head -n1 | awk '{print ;}')
if [[ "$first" == "$first_col" ]]
then
echo $line >> File_C
fi
done <File_B
done
在理解 <()
命令重定向的 shell 中(像 bash
或 zsh
但不是 posix sh
)使用 GNU grep
:
grep -wf <(sed 's/^/^/' file_a) file_b > file_c
-f filename
从给定文件中读取 patterns/words 的列表,在本例中是 sed 's/^/^/' file_a
的输出,它放置了一个 ^
开始的-每行开头的行锚点(如果您的 file_a
包含正则表达式中的特殊字符,这将无法正常工作),并且 -w
仅匹配整个单词,以避免以下情况之一你的话是一行中第一个单词的前缀。
请查看以下基于您 shell 脚本创建的代码。
use strict;
use warnings;
use feature 'say';
my $file_a = 'File_A';
my $file_b = 'File_B';
my $file_c = 'File_C';
# read File_A into array @data_a
open my $fh_a, '<', $file_a
or die "Couldn't open $file_a $!";
my @data_a = <$fh_a>;
close $fh_a;
# read File_B into array @data_b
open my $fh_b, '<', $file_b
or die "Couldn't open $file_b $!";
my @data_b = <$fh_b>;
close $fh_b;
chomp @data_a; # snip eol
chomp @data_b; # snip eol
# store found result into File_C
open my $fh_c, '>', $file_c
or die "Couldn't open $file_b $!";
for my $word_a (@data_a) {
for my $line_b (@data_b) {
say $fh_c $line_b if $line_b =~ /^$word_a\b/;
}
}
close $fh_c;
输入File_A
he
I
there
输入File_B
he was at least equally intrigued by hers.
I guess he's going to use it in his business.
I don't know if he's angry or not.
there were five dogs.
there is fly in my soup.
we don't know what he is doing.
结果File_C
he was at least equally intrigued by hers.
I guess he's going to use it in his business.
I don't know if he's angry or not.
we don't know what he is doing.
I guess he's going to use it in his business.
I don't know if he's angry or not.
there were five dogs.
there is fly in my soup.
Perl 中的类似内容:
#!/usr/bin/perl
use strict;
use warnings;
# Open File_A
open my $fh_a, '<', 'File_A' or die $!;
# Read words from File_A and remove newlines
chomp(my @words = <$fh_a>);
# Create a regex matching the words from File_A
# at the start of a line
my $word_re = '^(' . join('|', @words) . ')\b';
$word_re = qr($word_re);
# Open files B and C
open my $fh_b, '<', 'File_B' or die $!;
open my $fh_c, '>', 'File_C' or die $!;
# Read File_B a line at a time and write to
# File_C any lines that match our regex.
while (<$fh_b>) {
print $fh_c $_ if /$word_re/;
}
我有两个文件,即 File_A 和 File_B。 File_A 每行包含一个单词,File_B 包含句子。我必须从 File_A 中读取单词并搜索 File_B 中以该单词开头的行并将整行复制到 File_C。 File_A 和 File_B 都已排序
例如
File_A :
he
I
there
File_B :
he was at least equally intrigued by hers.
I guess he's going to use it in his business.
I don't know if he's angry or not.
there were five dogs.
there is fly in my soup.
we don't know what he is doing.
File_C:
he was at least equally intrigued by hers.
I guess he's going to use it in his business.
I don't know if he's angry or not.
there were five dogs.
there is fly in my soup.
我尝试使用 shell 脚本,但它是启发式方法,因此需要很长时间。 File_A 和 File_B 都是大文件。
这是我试过的代码
#! /bin/bash
for first in `cat File_A`
do
while read line
do
first_col=$(echo $line|head -n1 | awk '{print ;}')
if [[ "$first" == "$first_col" ]]
then
echo $line >> File_C
fi
done <File_B
done
在理解 <()
命令重定向的 shell 中(像 bash
或 zsh
但不是 posix sh
)使用 GNU grep
:
grep -wf <(sed 's/^/^/' file_a) file_b > file_c
-f filename
从给定文件中读取 patterns/words 的列表,在本例中是 sed 's/^/^/' file_a
的输出,它放置了一个 ^
开始的-每行开头的行锚点(如果您的 file_a
包含正则表达式中的特殊字符,这将无法正常工作),并且 -w
仅匹配整个单词,以避免以下情况之一你的话是一行中第一个单词的前缀。
请查看以下基于您 shell 脚本创建的代码。
use strict;
use warnings;
use feature 'say';
my $file_a = 'File_A';
my $file_b = 'File_B';
my $file_c = 'File_C';
# read File_A into array @data_a
open my $fh_a, '<', $file_a
or die "Couldn't open $file_a $!";
my @data_a = <$fh_a>;
close $fh_a;
# read File_B into array @data_b
open my $fh_b, '<', $file_b
or die "Couldn't open $file_b $!";
my @data_b = <$fh_b>;
close $fh_b;
chomp @data_a; # snip eol
chomp @data_b; # snip eol
# store found result into File_C
open my $fh_c, '>', $file_c
or die "Couldn't open $file_b $!";
for my $word_a (@data_a) {
for my $line_b (@data_b) {
say $fh_c $line_b if $line_b =~ /^$word_a\b/;
}
}
close $fh_c;
输入File_A
he
I
there
输入File_B
he was at least equally intrigued by hers.
I guess he's going to use it in his business.
I don't know if he's angry or not.
there were five dogs.
there is fly in my soup.
we don't know what he is doing.
结果File_C
he was at least equally intrigued by hers.
I guess he's going to use it in his business.
I don't know if he's angry or not.
we don't know what he is doing.
I guess he's going to use it in his business.
I don't know if he's angry or not.
there were five dogs.
there is fly in my soup.
Perl 中的类似内容:
#!/usr/bin/perl
use strict;
use warnings;
# Open File_A
open my $fh_a, '<', 'File_A' or die $!;
# Read words from File_A and remove newlines
chomp(my @words = <$fh_a>);
# Create a regex matching the words from File_A
# at the start of a line
my $word_re = '^(' . join('|', @words) . ')\b';
$word_re = qr($word_re);
# Open files B and C
open my $fh_b, '<', 'File_B' or die $!;
open my $fh_c, '>', 'File_C' or die $!;
# Read File_B a line at a time and write to
# File_C any lines that match our regex.
while (<$fh_b>) {
print $fh_c $_ if /$word_re/;
}