如何将来自不同目录的相似命名序列文件分类为单个大型 fasta 文件

Question

我正在努力完成以下工作。我有大约 40 个不同物种的目录，每个目录都有数百个包含直系同源序列的序列文件。每个物种目录的序列文件的名称相似。我想将40个物种目录的同名文件连接成一个同名的序列文件。

我的数据如下所示，例如：

directories: Species1 Species2 Species3 
  Within directory (similar for all): sequenceA.fasta sequenceB.fasta sequenceC.fasta

I want to get single files named: sequenceA.fasta sequenceB.fasta sequenceC.fasta 
where the content of the different files from the different species is concatenated.

我试图用一个循环来解决这个问题（但这对我来说永远不会结束！）：

ls . | while read FILE; do cat ./*/"$FILE" >> ./final/"$FILE"; done

这导致了空文件和错误。我确实尝试在其他地方找到解决方案，例如：(https://www.unix.com/unix-for-dummies-questions-and-answers/249952-cat-multiple-files-according-file-name.html, https://unix.stackexchange.com/questions/424204/how-to-combine-multiple-files-with-similar-names-in-different-folders-by-using-u) 但我无法根据我的情况编辑它们。

有人可以帮我吗？谢谢！

Answer 1

在你的物种目录所在的根目录中，你应该运行以下内容：

$ mkdir output
$ find Species* -type f -name "*.fasta" -exec sh -c 'cat {} >> output/`basename {}`' \;

递归遍历所有文件，将basename相同的文件内容合并到output目录下

编辑： 尽管这是一个公认的答案，但在评论中，OP 提到实际目录与常见模式不匹配 Species*，如中所示原来的问题。在这种情况下，您可以使用：

$ find -type f -not -path "./output/*" -name "*.fasta" -exec sh -c 'cat {} >> output/`basename {}`' \;

这样，我们不指定搜索模式，而是明确省略 output 目录以避免重复已处理的数据。

如何将来自不同目录的相似命名序列文件分类为单个大型 fasta 文件

How to cat similar named sequence files from different directories into single large fasta file

unix

loops

concat