按范围查找编号文件 - 模式匹配 - 从命令行
Finding a numbered file by range that falls within it - pattern matching - from command line
包含分块文件(>1000 个文件)的文件夹,格式为:
text_chr[A]_[numberB]_[numberC]_text.vcf.gz
numberB 到 numberC 是一个范围。
示例:
main_programme_ver2_chr1_1_1000000_VEPannot.vcf.gz
main_programme_ver2_chr1_1000001_987325982_VEPannot.vcf.gz
..
main_programme_ver2_chr2_1_875832_VEPannot.vcf.gz
main_programme_ver2_chr2_875833_100098325_VEPannot.vcf.gz
etc
包含第 4 列基因名称的文件、染色体 (col1) 和坐标(col2 和 3)(>20000 个条目),以下几行:
chr1 1000848 3959403 HAT1
chr2 83523 85382 JKLP
格式:A B C Gene_name,B到C也是一个范围
关于该基因的信息位于文件夹 1 中的一个文件中,因此我需要在文件名范围内对基因位置进行模式匹配。例如,我想知道将包含基因 HAT1
和 JKLP
的文件,答案分别是 main_programme_ver2_chr1_1000001_987325982_VEPannot.vcf.gz
和 main_programme_ver2_chr2_1_875832_VEPannot.vcf.gz
,目前我正在手动执行此操作.我需要这些文件名来输入一些下游分析。
所以我需要匹配 A
然后从基因列表 select 包含范围 B
到 C
的块。
有没有办法从命令行执行此操作?
非常感谢
更新: OP 用一组更具代表性的 *.vcf.gz
文件名更新了问题。
此时我假设文件名的格式为 ...
some_text_chrXXX_number1_number2_more_text.vcf.gz
^^^^^^^^^^^^^^^^^^^^^^
...我们对文件名的这一部分感兴趣:chrXXX_number1_number2
.
在建议的代码(下方)中,我们将遍历这些文件名,将名称解析为块,然后处理这些块。我们将用来解析文件名的步骤:
f=some_text_chrXXX_number1_number2_more_text.vcf.gz
g=${f//*_chr/chr} # strip off 'some_text_'
h=${g//.vcf.gz} # strip off '.vcf.gz'
echo "f=${f}"
echo "g=${g}"
echo "h=${h}"
IFS='_' read -r cx n1 n2 stuff <<< "${h}" # break $h into 4 variables
echo "cx=${cx}"
echo "n1=${n1}"
echo "n2=${n2}"
echo "stuff=${stuff}" # catch all for rest of '$h'
这会生成:
f=some_text_chrXXX_number1_number2_more_text.vcf.gz
g=chrXXX_number1_number2_more_text.vcf.gz
h=chrXXX_number1_number2_more_text
cx=chrXXX
n1=number1
n2=number2
stuff=more_text
假设:
- this question 的评论部分散列的详细信息适用于此问题
- 给定的基因可能
file1
多次出现
- 来自
file1
的给定数字范围可能匹配超过 1 个 *.vcf.gz
文件名
- 所有感兴趣的文件都在当前目录中(OP 可以根据需要向脚本添加适当的
cd
命令)
示例数据:
$ cat file1
chr1 1000848 3959403 HAT1
chr2 83523 85382 JKLP
chr3 20000 40000 STEV
chrX 23456 78901 WXYZ
$ ls -1v *vcf.gz
main_programme_ver2_chr1_1_1000000_VEPannot.vcf.gz
main_programme_ver2_chr1_1000001_987325982_VEPannot.vcf.gz # match HAT1
main_programme_ver2_chr2_1_875832_VEPannot.vcf.gz # match JKLP
main_programme_ver2_chr2_875833_100098325_VEPannot.vcf.gz
main_programme_ver2_chr3_1_25000_VEPannot.vcf.gz # match STEV
main_programme_ver2_chr3_25001_80000_VEPannot.vcf.gz # match STEV
main_programme_ver2_chr3_80001_100000_VEPannot.vcf.gz
main_programme_ver2_chrX_100000_999999_VEPannot.vcf.gz
使用几个 bash
循环的一个想法:
$ cat gene.bash
#!/usr/bin/bash
read -p "Gene to search for: " gene
echo "+++++++++++ gene: ${gene}"
found=0
while read -r chr b c stuff # read fields from file1
do
for f in *_${chr}_* # for all files that match 'chr' string from file1 ...
do
g=${f//*_chr/chr}
h=${g//.vcf.gz}
IFS='_' read -r cx n1 n2 stuff <<< "${h}" # break 'h' into chunks based on delimiter '_'
# check each value from file1 (b,c) for inclusion in filename ranges (n1-n2)
[[ "${b}" -ge "${n1}" ]] && [[ "${b}" -le "${n2}" ]] && echo "${f}" && found=1 && continue
[[ "${c}" -ge "${n1}" ]] && [[ "${c}" -le "${n2}" ]] && echo "${f}" && found=1
done
done < <(grep -w "${gene}" file1) # search file1 for rows containing 'gene'
[[ "${found}" -ne 1 ]] && echo "WARNING: no files found for gene = '${gene}'"
测试运行:
$ ./gene.bash
Gene to search for: HAT1
+++++++++++ gene: HAT1
main_programme_ver2_chr1_1000001_987325982_VEPannot.vcf.gz
$ ./gene.bash
Gene to search for: JKLP
+++++++++++ gene: JKLP
main_programme_ver2_chr2_1_875832_VEPannot.vcf.gz
$ ./gene.bash
Gene to search for: STEV
+++++++++++ gene: STEV
main_programme_ver2_chr3_1_25000_VEPannot.vcf.gz
main_programme_ver2_chr3_25001_80000_VEPannot.vcf.gz
$ ./gene.bash
Gene to search for: WXYZ
+++++++++++ gene: WXYZ
WARNING: no files found for gene = 'WXYZ'
$ ./gene.bash
Gene to search for: ZZZZ
+++++++++++ gene: ZZZZ
WARNING: no files found for gene = 'ZZZZ'
包含分块文件(>1000 个文件)的文件夹,格式为:
text_chr[A]_[numberB]_[numberC]_text.vcf.gz
numberB 到 numberC 是一个范围。
示例:
main_programme_ver2_chr1_1_1000000_VEPannot.vcf.gz
main_programme_ver2_chr1_1000001_987325982_VEPannot.vcf.gz
..
main_programme_ver2_chr2_1_875832_VEPannot.vcf.gz
main_programme_ver2_chr2_875833_100098325_VEPannot.vcf.gz
etc
包含第 4 列基因名称的文件、染色体 (col1) 和坐标(col2 和 3)(>20000 个条目),以下几行:
chr1 1000848 3959403 HAT1
chr2 83523 85382 JKLP
格式:A B C Gene_name,B到C也是一个范围
关于该基因的信息位于文件夹 1 中的一个文件中,因此我需要在文件名范围内对基因位置进行模式匹配。例如,我想知道将包含基因 HAT1
和 JKLP
的文件,答案分别是 main_programme_ver2_chr1_1000001_987325982_VEPannot.vcf.gz
和 main_programme_ver2_chr2_1_875832_VEPannot.vcf.gz
,目前我正在手动执行此操作.我需要这些文件名来输入一些下游分析。
所以我需要匹配 A
然后从基因列表 select 包含范围 B
到 C
的块。
有没有办法从命令行执行此操作?
非常感谢
更新: OP 用一组更具代表性的 *.vcf.gz
文件名更新了问题。
此时我假设文件名的格式为 ...
some_text_chrXXX_number1_number2_more_text.vcf.gz
^^^^^^^^^^^^^^^^^^^^^^
...我们对文件名的这一部分感兴趣:chrXXX_number1_number2
.
在建议的代码(下方)中,我们将遍历这些文件名,将名称解析为块,然后处理这些块。我们将用来解析文件名的步骤:
f=some_text_chrXXX_number1_number2_more_text.vcf.gz
g=${f//*_chr/chr} # strip off 'some_text_'
h=${g//.vcf.gz} # strip off '.vcf.gz'
echo "f=${f}"
echo "g=${g}"
echo "h=${h}"
IFS='_' read -r cx n1 n2 stuff <<< "${h}" # break $h into 4 variables
echo "cx=${cx}"
echo "n1=${n1}"
echo "n2=${n2}"
echo "stuff=${stuff}" # catch all for rest of '$h'
这会生成:
f=some_text_chrXXX_number1_number2_more_text.vcf.gz
g=chrXXX_number1_number2_more_text.vcf.gz
h=chrXXX_number1_number2_more_text
cx=chrXXX
n1=number1
n2=number2
stuff=more_text
假设:
- this question 的评论部分散列的详细信息适用于此问题
- 给定的基因可能
file1
多次出现 - 来自
file1
的给定数字范围可能匹配超过 1 个*.vcf.gz
文件名 - 所有感兴趣的文件都在当前目录中(OP 可以根据需要向脚本添加适当的
cd
命令)
示例数据:
$ cat file1
chr1 1000848 3959403 HAT1
chr2 83523 85382 JKLP
chr3 20000 40000 STEV
chrX 23456 78901 WXYZ
$ ls -1v *vcf.gz
main_programme_ver2_chr1_1_1000000_VEPannot.vcf.gz
main_programme_ver2_chr1_1000001_987325982_VEPannot.vcf.gz # match HAT1
main_programme_ver2_chr2_1_875832_VEPannot.vcf.gz # match JKLP
main_programme_ver2_chr2_875833_100098325_VEPannot.vcf.gz
main_programme_ver2_chr3_1_25000_VEPannot.vcf.gz # match STEV
main_programme_ver2_chr3_25001_80000_VEPannot.vcf.gz # match STEV
main_programme_ver2_chr3_80001_100000_VEPannot.vcf.gz
main_programme_ver2_chrX_100000_999999_VEPannot.vcf.gz
使用几个 bash
循环的一个想法:
$ cat gene.bash
#!/usr/bin/bash
read -p "Gene to search for: " gene
echo "+++++++++++ gene: ${gene}"
found=0
while read -r chr b c stuff # read fields from file1
do
for f in *_${chr}_* # for all files that match 'chr' string from file1 ...
do
g=${f//*_chr/chr}
h=${g//.vcf.gz}
IFS='_' read -r cx n1 n2 stuff <<< "${h}" # break 'h' into chunks based on delimiter '_'
# check each value from file1 (b,c) for inclusion in filename ranges (n1-n2)
[[ "${b}" -ge "${n1}" ]] && [[ "${b}" -le "${n2}" ]] && echo "${f}" && found=1 && continue
[[ "${c}" -ge "${n1}" ]] && [[ "${c}" -le "${n2}" ]] && echo "${f}" && found=1
done
done < <(grep -w "${gene}" file1) # search file1 for rows containing 'gene'
[[ "${found}" -ne 1 ]] && echo "WARNING: no files found for gene = '${gene}'"
测试运行:
$ ./gene.bash
Gene to search for: HAT1
+++++++++++ gene: HAT1
main_programme_ver2_chr1_1000001_987325982_VEPannot.vcf.gz
$ ./gene.bash
Gene to search for: JKLP
+++++++++++ gene: JKLP
main_programme_ver2_chr2_1_875832_VEPannot.vcf.gz
$ ./gene.bash
Gene to search for: STEV
+++++++++++ gene: STEV
main_programme_ver2_chr3_1_25000_VEPannot.vcf.gz
main_programme_ver2_chr3_25001_80000_VEPannot.vcf.gz
$ ./gene.bash
Gene to search for: WXYZ
+++++++++++ gene: WXYZ
WARNING: no files found for gene = 'WXYZ'
$ ./gene.bash
Gene to search for: ZZZZ
+++++++++++ gene: ZZZZ
WARNING: no files found for gene = 'ZZZZ'