快速计数直到第一次匹配并在 megastring 中停止

Question

我想计算 megarow 中模式 030 之前的字符数（不要从该点向前读取数据），这样您就不会在内存中读取整个 megarow。应该 return 28.

巨串数据

48000000fe5a1eda480000000d00030001000000cd010000020000000000000000000000000000000000000000000000000000000200000001000000ffffffff57ea5e55ff640c00585e0000fe5a1eda480000000d00030007000000cd010000010000000000000002000000000000800000000000000000000000

我最初的想法是在 030 的第一个实例中拆分，但我没有成功。我也不熟悉 split 命令在模式结束前只读的能力。

你怎么能快速数到第一场比赛？

Answer 1

如果您的 megarow 在名为 megarow_file 的文件中，您可以执行以下操作：

#!/bin/bash

INPUT=megarow_file
SEARCH_STRING="030"

comp_string=""

while IFS= read -r -n1 char
do
    char_count=`expr $char_count + 1`
    comp_string="${comp_string}${char}"

    comp_string_length=${#comp_string}

    if [ $comp_string_length -eq 3 ]; then
       # echo comparing value $comp_string
       if [ $comp_string = $SEARCH_STRING ]; then
           # echo match
           break
       fi
    fi

    if [ $comp_string_length -gt 3 ]; then
        # echo its bigger than 3, strip 1st char
        comp_string="${comp_string:1:3}"
        # echo comparing value $comp_string
        if [ $comp_string = $SEARCH_STRING ]; then
            # echo match
            break
        fi
    fi

done < "$INPUT"

count_up_to_comp_string=`expr $char_count - ${#SEARCH_STRING}`
echo "Length up to ${SEARCH_STRING} was ${count_up_to_comp_string} characters"

Answer 2

比较 GNU awk 和 BSD AWK 由 BlueMoon 的评论发起

$ time cat megaRow | awk '{print index([=10=], "fafafafa")-1}'
48584    
real    1m13.489s
user    1m11.608s
sys 0m4.685s

$ time cat megaRow | gawk '{print index([=10=], "fafafafa")-1}'
48584    
real    1m12.792s
user    1m8.845s
sys 0m4.933s

其中 GNU AWK 稍快但不够显着，因为在不确定性范围内。

快速计数直到第一次匹配并在 megastring 中停止

To count quickly until first match and stop in megastring

unix

count