对每一列应用不同的数组并以编程方式打印每个数组的结果?

Apply different array to each column and print results from each array programmatically?

我有一个包含很多列(字段)的 table。在第一个字段中,我只需要保留唯一值。在随后的列中,我需要计算第一列中存在的值的原始数量,但前提是给定列中的值 > 0。

我已经设法用 awk 部分完成了这个,但我目前的尝试需要我为 table 和 [= 中的每一列手动创建一个数组32=] 为打印命令手动键入每个数组。这不太可行。

任何help/suggestions(以及潜在解决方案如何工作的解释)将不胜感激。

这是 INPUT TABLE 的子集(已在第 1 列排序):

ATP6          93.883156   55.84006
COX1          230.708456  63.109
COX2          179.993226  74.224269
COX3          169.945901  72.036519
CYTB          228.799722  87.575892
LOC111099029  0.926958    6.124982
LOC111099030  10.124096   5.024844
LOC111099031  0           0
LOC111099031  0           0
LOC111099031  2.279801    2.289838
LOC111099032  17.674714   12.796428
LOC111099033  5.259716    7.326938
LOC111099034  3.514635    2.858349
LOC111099035  0           0
LOC111099035  1.929607    4.409107
LOC111099036  0           0
LOC111099036  1.45196     7.58513
LOC111099037  21.520663   26.353308
LOC111099038  6.019084    5.311657
LOC111099039  12.858404   13.689644
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0           0
LOC111099040  0.354202    0.265986
LOC111099040  0.587969    0
LOC111099040  2.620288    1.077892
LOC111099040  4.290659    3.487692
LOC111099040  6.42671     6.906503
LOC111099041  0           0
LOC111099041  3.892818    4.934959
LOC111099042  0           0
LOC111099042  13.86859    14.319505
LOC111099043  0           0

以下是所需输出的示例:

LOC111099030  1  1
CYTB          1  1
LOC111099042  1  1
LOC111099037  1  1
LOC111099033  1  1
COX3          1  1
ATP6          1  1
LOC111099039  1  1
LOC111099036  1  1
LOC111099040  5  4
LOC111099035  1  1
LOC111099032  1  1
COX2          1  1
LOC111099038  1  1
LOC111099031  1  1
COX1          1  1
LOC111099029  1  1
LOC111099041  1  1
LOC111099034  1  1

这是我运行获得上面输出的代码:

awk '{if ( > 0) gene_name[]++}; {if ( > 0) col3_arr[]++}; END{ for (var in gene_name) print var, "\t", gene_name[var], col3_arr[var]}' input_file.txt

P.S。我也对 R 中的解决方案持开放态度,因为此操作是更大的 R Markdown 笔记本的一部分。我选择了 awk 路线,因为我不是特别精通 R

在 R 中,dplyr:

library(dplyr)
desired_result = your_data %>%
  group_by(name_of_first_column) %>%
  summarize(across(everything(), ~sum(. > 0)))

base R中,我们可以用rowsum

rowsum(+(df1[-1] > 0), df1[[1]])