在大文件中查找单词并复制包含该单词的行

Find the word in large file and copy the line which contains that word

我有两个文件,即 File_A 和 File_B。 File_A 每行包含一个单词,File_B 包含句子。我必须从 File_A 中读取单词并搜索 File_B 中以该单词开头的行并将整行复制到 File_C。 File_A 和 File_B 都已排序

例如

File_A :

he
I
there

File_B :

he was at least equally intrigued by hers.
I guess he's going to use it in his business.
I don't know if he's angry or not.
there were five dogs.
there is fly in my soup.
we don't know what he is doing.

File_C:

he was at least equally intrigued by hers.
I guess he's going to use it in his business.
I don't know if he's angry or not.
there were five dogs.
there is fly in my soup.

我尝试使用 shell 脚本,但它是启发式方法,因此需要很长时间。 File_A 和 File_B 都是大文件。

这是我试过的代码

#! /bin/bash

for first in `cat File_A`
do
    while read line 
    do
        first_col=$(echo $line|head -n1 | awk '{print ;}')
        if [[ "$first" == "$first_col" ]]
        then
                 echo $line >> File_C
            fi  

    done <File_B
done

在理解 <() 命令重定向的 shell 中(像 bashzsh 但不是 posix sh)使用 GNU grep:

grep -wf <(sed 's/^/^/' file_a) file_b > file_c

-f filename 从给定文件中读取 patterns/words 的列表,在本例中是 sed 's/^/^/' file_a 的输出,它放置了一个 ^ 开始的-每行开头的行锚点(如果您的 file_a 包含正则表达式中的特殊字符,这将无法正常工作),并且 -w 仅匹配整个单词,以避免以下情况之一你的话是一行中第一个单词的前缀。

请查看以下基于您 shell 脚本创建的代码。

use strict;
use warnings;
use feature 'say';

my $file_a = 'File_A';
my $file_b = 'File_B';
my $file_c = 'File_C';

# read File_A into array @data_a
open my $fh_a, '<', $file_a
    or die "Couldn't open $file_a $!";

my @data_a = <$fh_a>;

close $fh_a;

# read File_B into array @data_b
open my $fh_b, '<', $file_b
    or die "Couldn't open $file_b $!";

my @data_b = <$fh_b>;

close $fh_b;

chomp @data_a;      # snip eol
chomp @data_b;      # snip eol

# store found result into File_C
open my $fh_c, '>', $file_c
    or die "Couldn't open $file_b $!";

for my $word_a (@data_a) {
    for my $line_b (@data_b) {
        say $fh_c $line_b if $line_b =~ /^$word_a\b/;
    }
}

close $fh_c;

输入File_A

he
I
there

输入File_B

he was at least equally intrigued by hers.
I guess he's going to use it in his business.
I don't know if he's angry or not.
there were five dogs.
there is fly in my soup.
we don't know what he is doing.

结果File_C

he was at least equally intrigued by hers.
I guess he's going to use it in his business.
I don't know if he's angry or not.
we don't know what he is doing.
I guess he's going to use it in his business.
I don't know if he's angry or not.
there were five dogs.
there is fly in my soup.

Perl 中的类似内容:

#!/usr/bin/perl

use strict;
use warnings;

# Open File_A
open my $fh_a, '<', 'File_A' or die $!;

# Read words from File_A and remove newlines
chomp(my @words = <$fh_a>);

# Create a regex matching the words from File_A
# at the start of a line
my $word_re = '^(' . join('|', @words) . ')\b';
$word_re = qr($word_re);

# Open files B and C
open my $fh_b, '<', 'File_B' or die $!;
open my $fh_c, '>', 'File_C' or die $!;

# Read File_B a line at a time and write to
# File_C any lines that match our regex.
while (<$fh_b>) {
  print $fh_c $_ if /$word_re/;
}