按范围查找编号文件 - 模式匹配 - 从命令行

Finding a numbered file by range that falls within it - pattern matching - from command line

包含分块文件(>1000 个文件)的文件夹,格式为:

text_chr[A]_[numberB]_[numberC]_text.vcf.gz 

numberB 到 numberC 是一个范围。

示例:

main_programme_ver2_chr1_1_1000000_VEPannot.vcf.gz 
main_programme_ver2_chr1_1000001_987325982_VEPannot.vcf.gz 
..
main_programme_ver2_chr2_1_875832_VEPannot.vcf.gz 
main_programme_ver2_chr2_875833_100098325_VEPannot.vcf.gz 
etc 

包含第 4 列基因名称的文件、染色体 (col1) 和坐标(col2 和 3)(>20000 个条目),以下几行:

chr1 1000848 3959403 HAT1 
chr2 83523 85382 JKLP

格式:A B C Gene_name,B到C也是一个范围

关于该基因的信息位于文件夹 1 中的一个文件中,因此我需要在文件名范围内对基因位置进行模式匹配。例如,我想知道将包含基因 HAT1JKLP 的文件,答案分别是 main_programme_ver2_chr1_1000001_987325982_VEPannot.vcf.gzmain_programme_ver2_chr2_1_875832_VEPannot.vcf.gz,目前我正在手动执行此操作.我需要这些文件名来输入一些下游分析。

所以我需要匹配 A 然后从基因列表 select 包含范围 BC 的块。

有没有办法从命令行执行此操作?

非常感谢

更新: OP 用一组更具代表性的 *.vcf.gz 文件名更新了问题。

此时我假设文件名的格式为 ...

some_text_chrXXX_number1_number2_more_text.vcf.gz
          ^^^^^^^^^^^^^^^^^^^^^^

...我们对文件名的这一部分感兴趣:chrXXX_number1_number2.

在建议的代码(下方)中,我们将遍历这些文件名,将名称解析为块,然后处理这些块。我们将用来解析文件名的步骤:

f=some_text_chrXXX_number1_number2_more_text.vcf.gz
g=${f//*_chr/chr}                                    # strip off 'some_text_'
h=${g//.vcf.gz}                                      # strip off '.vcf.gz'
echo "f=${f}"
echo "g=${g}"
echo "h=${h}"

IFS='_' read -r cx n1 n2 stuff <<< "${h}"            # break $h into 4 variables

echo "cx=${cx}"
echo "n1=${n1}"
echo "n2=${n2}"
echo "stuff=${stuff}"                                # catch all for rest of '$h'

这会生成:

f=some_text_chrXXX_number1_number2_more_text.vcf.gz
g=chrXXX_number1_number2_more_text.vcf.gz
h=chrXXX_number1_number2_more_text
cx=chrXXX
n1=number1
n2=number2
stuff=more_text


假设:

  • this question 的评论部分散列的详细信息适用于此问题
  • 给定的基因可能 file1 多次出现
  • 来自 file1 的给定数字范围可能匹配超过 1 个 *.vcf.gz 文件名
  • 所有感兴趣的文件都在当前目录中(OP 可以根据需要向脚本添加适当的 cd 命令)

示例数据:

$ cat file1
chr1 1000848 3959403 HAT1 
chr2 83523 85382 JKLP
chr3 20000 40000 STEV
chrX 23456 78901 WXYZ

$ ls -1v *vcf.gz
main_programme_ver2_chr1_1_1000000_VEPannot.vcf.gz
main_programme_ver2_chr1_1000001_987325982_VEPannot.vcf.gz  # match HAT1

main_programme_ver2_chr2_1_875832_VEPannot.vcf.gz           # match JKLP
main_programme_ver2_chr2_875833_100098325_VEPannot.vcf.gz 

main_programme_ver2_chr3_1_25000_VEPannot.vcf.gz            # match STEV
main_programme_ver2_chr3_25001_80000_VEPannot.vcf.gz        # match STEV
main_programme_ver2_chr3_80001_100000_VEPannot.vcf.gz

main_programme_ver2_chrX_100000_999999_VEPannot.vcf.gz

使用几个 bash 循环的一个想法:

$ cat gene.bash
#!/usr/bin/bash

read -p "Gene to search for: " gene

echo "+++++++++++ gene: ${gene}"

found=0

while read -r chr b c stuff                        # read fields from file1
do
    for f in *_${chr}_*                            # for all files that match 'chr' string from file1 ...
    do
        g=${f//*_chr/chr}
        h=${g//.vcf.gz}

        IFS='_' read -r cx n1 n2 stuff <<< "${h}"  # break 'h' into chunks based on delimiter '_'

        # check each value from file1 (b,c) for inclusion in filename ranges (n1-n2)

        [[ "${b}" -ge "${n1}" ]] && [[ "${b}" -le "${n2}" ]] && echo "${f}" && found=1 && continue
        [[ "${c}" -ge "${n1}" ]] && [[ "${c}" -le "${n2}" ]] && echo "${f}" && found=1
    done

done < <(grep -w "${gene}" file1)                  # search file1 for rows containing 'gene'

[[ "${found}" -ne 1 ]] && echo "WARNING: no files found for gene = '${gene}'"

测试运行:

$ ./gene.bash
Gene to search for: HAT1
+++++++++++ gene: HAT1
main_programme_ver2_chr1_1000001_987325982_VEPannot.vcf.gz

$ ./gene.bash
Gene to search for: JKLP
+++++++++++ gene: JKLP
main_programme_ver2_chr2_1_875832_VEPannot.vcf.gz

$ ./gene.bash
Gene to search for: STEV
+++++++++++ gene: STEV
main_programme_ver2_chr3_1_25000_VEPannot.vcf.gz
main_programme_ver2_chr3_25001_80000_VEPannot.vcf.gz

$ ./gene.bash
Gene to search for: WXYZ
+++++++++++ gene: WXYZ
WARNING: no files found for gene = 'WXYZ'

$ ./gene.bash
Gene to search for: ZZZZ
+++++++++++ gene: ZZZZ
WARNING: no files found for gene = 'ZZZZ'