添加/附加大文件的有效方法
Efficient way to add/ append huge files
下面是一个 shell 脚本,用于处理一个巨大的文件。它通常逐行读取固定长度的文件,执行子字符串并作为分隔文件附加到另一个文件中。效果很好,就是太慢了。
array=() # Create array
while IFS='' read -r line || [[ -n "$line" ]] # Read a line
do
coOrdinates="$(echo -e "${line}" | grep POSITION | cut -d'(' -f2 | cut -d')' -f1 | cut -d':' -f1,2)"
if [[ -z "${coOrdinates// }" ]];
then
echo "Not adding"
else
array+=("$coOrdinates")
fi
done < "_CTRL.txt"
while read -r line;
do
result='"'
for e in "${array[@]}"
do
SUBSTRING1=`echo "$e" | sed 's/.*://'`
SUBSTRING=`echo "$e" | sed 's/:.*//'`
result1=`perl -e "print substr('$line', $SUBSTRING,$SUBSTRING1)"`
result1="$(echo -e "${result1}" | sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//')"
result=$result$result1'"'',''"'
done
echo $result >> _1.txt
done < ".txt"
之前我是用cut命令改成上面的,但是时间上没有任何改善。
请建议可以进行哪些更改以缩短处理时间。
提前致谢
更新:
输入文件的示例内容:
XLS01G702012 000034444132412342134
控制文件:
OPTIONS (DIRECT=TRUE, ERRORS=1000, rows=500000) UNRECOVERABLE
load data
CHARACTERSET 'UTF8'
TRUNCATE
into table icm_rls_clientrel2_hg
trailing nullcols
(
APP_ID POSITION(1:3) "TRIM(:APP_ID)",
RELATIONSHIP_NO POSITION(4:21) "TRIM(:RELATIONSHIP_NO)"
)
输出文件:
"LS0","1G702012 0000"
更新答案
这是我使用 awk
解析控制文件的版本,保存字符位置,然后在解析输入文件时使用它们:
awk '
/APP_ID/ {
sub(/\).*/,"") # Strip closing parenthesis and all that follows
sub(/^.*\(/,"") # Strip everything up to opening parenthesis
split([=10=],a,":") # Extract the two character positions separated by colon into array "a"
next
}
/RELATIONSHIP/ {
sub(/\).*/,"") # Strip closing parenthesis and all that follows
sub(/^.*\(/,"") # Strip everything up to opening parenthesis
split([=10=],b,"[():]") # Extract character positions into array "b"
next
}
FNR==NR{next}
{ f1=substr([=10=],a[1]+1,a[2]); f2=substr([=10=],b[1]+1,b[2]); printf("\"%s\",\"%s\"\n",f1,f2)}
' ControlFile InputFile
原答案
这不是一个完整、严格的答案,但这应该让您了解一旦从控制文件中获得 POSITION 参数后如何使用 awk
进行提取:
awk -v a=2 -v b=3 -v c=5 -v d=21 '{f1=substr([=11=],a,b); f2=substr([=11=],c,d); printf("\"%s\",\"%s\"\n",f1,f2)}' InputFile
示例输出
"LS0","1G702012 00003"
在您的大型输入文件上尝试 运行 以了解性能,然后调整输出。读取控制文件根本不是 time-critical 所以不要费心去优化它。
我建议,使用纯 bash 并避免子 shell:
if [[ $line =~ POSITION ]] ; then # grep POSITION
coOrdinates="${line#*(}" # cut -d'(' -f2
coOrdinates="${coOrdinates%)*}" # cut -d')' -f1
coOrdinates="${coOrdinates/:/ }" # cut -d':' -f1,2
if [[ -z "${coOrdinates// }" ]]; then
echo "Not adding"
else
array+=("$coOrdinates")
fi
fi
效率更高,gniourf_gniourf :
if [[ $line =~ POSITION\(([[:digit:]]+):([[:digit:]])\) ]]; then
array+=( "${BASH_REMATCH[*]:1:2}" )
fi
类似地:
SUBSTRING1=${e#*:} # $( echo "$e" | sed 's/.*://' )
SUBSTRING= ${e%:*} # $( echo "$e" | sed 's/:.*//' )
# to confirm, I don't know perl substr
result1=${line:$SUBSTRING:$SUBSTRING1} # $( perl -e "print substr('$line', $SUBSTRING,$SUBSTRING1)" )
#result1= # "$(echo -e "${result1}" | sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//')"
# trim, if nécessary?
result1="${result1%${result1##*[^[:space:]]}}" # right
result1="${result1#${result1%%[^[:space:]]*}}" # left
gniourf_gniourf 建议将 grep 排除在循环之外:
while read ...; do
...
done < <(grep POSITION ...)
为了提高效率:while/read 循环在 Bash 中非常慢,因此尽可能多地预过滤会大大加快处理速度。
为了避免(缓慢的)while 循环,您可以使用 cut
和 paste
#!/bin/bash
inFile=${1:-checkHugeFile}.in
ctrlFile=${1:-checkHugeFile}_CTRL.txt
outFile=${1:-checkHugeFile}.txt
cat /dev/null > $outFile
typeset -a array # Create array
while read -r line # Read a line
do
coOrdinates="${line#*(}"
coOrdinates="${coOrdinates%%)*}"
[[ -z "${coOrdinates// }" ]] && { echo "Not adding"; continue; }
array+=("$coOrdinates")
done < <(grep POSITION "$ctrlFile" )
echo coOrdinates: "${array[@]}"
for e in "${array[@]}"
do
nr=$((nr+1))
start=${e%:*}
len=${e#*:}
from=$(( start + 1 ))
to=$(( start + len + 1 ))
cut -c$from-$to $inFile > ${outFile}.$nr
done
paste $outFile.* | sed -e 's/^/"/' -e 's/\t/","/' -e 's/$/"/' >${outFile}
rm $outFile.[0-9]
perl:
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
# read the control file
my $ctrl;
{
local $/ = "";
open my $fh, "<", shift @ARGV;
$ctrl = <$fh>;
close $fh;
}
my @positions = ( $ctrl =~ /\((\d+):(\d+)\)/g );
# read the data file
open my $fh, "<", shift @ARGV;
while (<$fh>) {
my @words;
for (my $i = 0; $i < scalar(@positions); $i += 2) {
push @words, substr($_, $positions[$i], $positions[$i+1]);
}
say join ",", map {qq("$_")} @words;
}
close $fh;
perl parse.pl x_CTRL.txt x.txt
"LS0","1G702012 00003"
与您要求的结果不同:
- 在控制文件的
POSITION(m:n)
语法中,n
是一个长度还是一个
索引?
- 在数据文件中,那些是空格还是制表符?
下面是一个 shell 脚本,用于处理一个巨大的文件。它通常逐行读取固定长度的文件,执行子字符串并作为分隔文件附加到另一个文件中。效果很好,就是太慢了。
array=() # Create array
while IFS='' read -r line || [[ -n "$line" ]] # Read a line
do
coOrdinates="$(echo -e "${line}" | grep POSITION | cut -d'(' -f2 | cut -d')' -f1 | cut -d':' -f1,2)"
if [[ -z "${coOrdinates// }" ]];
then
echo "Not adding"
else
array+=("$coOrdinates")
fi
done < "_CTRL.txt"
while read -r line;
do
result='"'
for e in "${array[@]}"
do
SUBSTRING1=`echo "$e" | sed 's/.*://'`
SUBSTRING=`echo "$e" | sed 's/:.*//'`
result1=`perl -e "print substr('$line', $SUBSTRING,$SUBSTRING1)"`
result1="$(echo -e "${result1}" | sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//')"
result=$result$result1'"'',''"'
done
echo $result >> _1.txt
done < ".txt"
之前我是用cut命令改成上面的,但是时间上没有任何改善。 请建议可以进行哪些更改以缩短处理时间。 提前致谢
更新:
输入文件的示例内容:
XLS01G702012 000034444132412342134
控制文件:
OPTIONS (DIRECT=TRUE, ERRORS=1000, rows=500000) UNRECOVERABLE
load data
CHARACTERSET 'UTF8'
TRUNCATE
into table icm_rls_clientrel2_hg
trailing nullcols
(
APP_ID POSITION(1:3) "TRIM(:APP_ID)",
RELATIONSHIP_NO POSITION(4:21) "TRIM(:RELATIONSHIP_NO)"
)
输出文件:
"LS0","1G702012 0000"
更新答案
这是我使用 awk
解析控制文件的版本,保存字符位置,然后在解析输入文件时使用它们:
awk '
/APP_ID/ {
sub(/\).*/,"") # Strip closing parenthesis and all that follows
sub(/^.*\(/,"") # Strip everything up to opening parenthesis
split([=10=],a,":") # Extract the two character positions separated by colon into array "a"
next
}
/RELATIONSHIP/ {
sub(/\).*/,"") # Strip closing parenthesis and all that follows
sub(/^.*\(/,"") # Strip everything up to opening parenthesis
split([=10=],b,"[():]") # Extract character positions into array "b"
next
}
FNR==NR{next}
{ f1=substr([=10=],a[1]+1,a[2]); f2=substr([=10=],b[1]+1,b[2]); printf("\"%s\",\"%s\"\n",f1,f2)}
' ControlFile InputFile
原答案
这不是一个完整、严格的答案,但这应该让您了解一旦从控制文件中获得 POSITION 参数后如何使用 awk
进行提取:
awk -v a=2 -v b=3 -v c=5 -v d=21 '{f1=substr([=11=],a,b); f2=substr([=11=],c,d); printf("\"%s\",\"%s\"\n",f1,f2)}' InputFile
示例输出
"LS0","1G702012 00003"
在您的大型输入文件上尝试 运行 以了解性能,然后调整输出。读取控制文件根本不是 time-critical 所以不要费心去优化它。
我建议,使用纯 bash 并避免子 shell:
if [[ $line =~ POSITION ]] ; then # grep POSITION
coOrdinates="${line#*(}" # cut -d'(' -f2
coOrdinates="${coOrdinates%)*}" # cut -d')' -f1
coOrdinates="${coOrdinates/:/ }" # cut -d':' -f1,2
if [[ -z "${coOrdinates// }" ]]; then
echo "Not adding"
else
array+=("$coOrdinates")
fi
fi
效率更高,gniourf_gniourf :
if [[ $line =~ POSITION\(([[:digit:]]+):([[:digit:]])\) ]]; then
array+=( "${BASH_REMATCH[*]:1:2}" )
fi
类似地:
SUBSTRING1=${e#*:} # $( echo "$e" | sed 's/.*://' )
SUBSTRING= ${e%:*} # $( echo "$e" | sed 's/:.*//' )
# to confirm, I don't know perl substr
result1=${line:$SUBSTRING:$SUBSTRING1} # $( perl -e "print substr('$line', $SUBSTRING,$SUBSTRING1)" )
#result1= # "$(echo -e "${result1}" | sed -e 's/^[[:space:]]*//' -e 's/[[:space:]]*$//')"
# trim, if nécessary?
result1="${result1%${result1##*[^[:space:]]}}" # right
result1="${result1#${result1%%[^[:space:]]*}}" # left
gniourf_gniourf 建议将 grep 排除在循环之外:
while read ...; do
...
done < <(grep POSITION ...)
为了提高效率:while/read 循环在 Bash 中非常慢,因此尽可能多地预过滤会大大加快处理速度。
为了避免(缓慢的)while 循环,您可以使用 cut
和 paste
#!/bin/bash
inFile=${1:-checkHugeFile}.in
ctrlFile=${1:-checkHugeFile}_CTRL.txt
outFile=${1:-checkHugeFile}.txt
cat /dev/null > $outFile
typeset -a array # Create array
while read -r line # Read a line
do
coOrdinates="${line#*(}"
coOrdinates="${coOrdinates%%)*}"
[[ -z "${coOrdinates// }" ]] && { echo "Not adding"; continue; }
array+=("$coOrdinates")
done < <(grep POSITION "$ctrlFile" )
echo coOrdinates: "${array[@]}"
for e in "${array[@]}"
do
nr=$((nr+1))
start=${e%:*}
len=${e#*:}
from=$(( start + 1 ))
to=$(( start + len + 1 ))
cut -c$from-$to $inFile > ${outFile}.$nr
done
paste $outFile.* | sed -e 's/^/"/' -e 's/\t/","/' -e 's/$/"/' >${outFile}
rm $outFile.[0-9]
perl:
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
# read the control file
my $ctrl;
{
local $/ = "";
open my $fh, "<", shift @ARGV;
$ctrl = <$fh>;
close $fh;
}
my @positions = ( $ctrl =~ /\((\d+):(\d+)\)/g );
# read the data file
open my $fh, "<", shift @ARGV;
while (<$fh>) {
my @words;
for (my $i = 0; $i < scalar(@positions); $i += 2) {
push @words, substr($_, $positions[$i], $positions[$i+1]);
}
say join ",", map {qq("$_")} @words;
}
close $fh;
perl parse.pl x_CTRL.txt x.txt
"LS0","1G702012 00003"
与您要求的结果不同:
- 在控制文件的
POSITION(m:n)
语法中,n
是一个长度还是一个 索引? - 在数据文件中,那些是空格还是制表符?