根据两列获取频率
get frequency based on two columns
我的大型数据框的片段看起来是这样的:
MARKERS.IN.HAPLOTYPES BASE rs. alleles chrom pos GID marker trial
1A.12 C S1A_494392059 C/G 1A 494392059 GID7173723 2 ES26-38
1A.13 C S1A_497201550 C/T 1A 497201550 GID7173723 0 ES26-38
1A.14 T S1A_499864157 C/T 1A 499864157 GID7173723 2 ES26-38
1B.10 A S1B_566171302 G/A 1B 566171302 GID7173723 0 ES26-38
1B.20 G S1B_642616640 A/G 1B 642616640 GID7173723 2 ES26-38
2B.10 A S2B_24883552 A/G 2B 24883552 GID7173723 2 ES26-38
这是其中的一个dput
:
structure(list(MARKERS.IN.HAPLOTYPES = c("1A.12", "1A.13", "1A.14",
"1B.10", "1B.20", "2B.10"), BASE = c("C", "C", "T", "A", "G",
"A"), rs. = c("S1A_494392059", "S1A_497201550", "S1A_499864157",
"S1B_566171302", "S1B_642616640", "S2B_24883552"), alleles = c("C/G",
"C/T", "C/T", "G/A", "A/G", "A/G"), chrom = c("1A", "1A", "1A",
"1B", "1B", "2B"), pos = c(494392059L, 497201550L, 499864157L,
566171302L, 642616640L, 24883552L), GID = c("GID7173723", "GID7173723",
"GID7173723", "GID7173723", "GID7173723", "GID7173723"), marker = c("2",
"0", "2", "0", "2", "2"), trial = c("ES26-38", "ES26-38", "ES26-38",
"ES26-38", "ES26-38", "ES26-38")), row.names = c(NA, 6L), class =
"data.frame")
原始数据框中 rs.
列有 22 个 unique
值,trial
列有六个 unique
值。我想计算每个唯一 rs.
和每个唯一 trial
列 marker
不同值的相对频率。因此,例如,列 rs.
S1A_494392059
的第一项将具有列 marker
用于试验的频率 ES26-38
等等,依此类推。请注意,marker
列是字符向量而不是数字。
你可以试试这个:
library(dplyr)
df %>%
add_count(rs., trial, name = "Total") %>%
add_count(rs., trial, marker, name = "MarkerTotal") %>%
mutate(RelativeFreq = round(MarkerTotal / Total, 2))
add_count
中的 name
列是 dplyr 0.8
之后的新功能,允许您决定名称(以前是 n
或 nn
默认)。如果您没有更新包,上述代码将无法运行。
虽然不是特别复杂,但您示例中的相对频率到处都是 1。
如果您想获得汇总数据框(其中唯一剩下的列将分组 rs.
、trial
和 RelativeFreq
),您可以这样做:
df %>%
add_count(rs., trial, marker, name = "MarkerTotal") %>%
group_by(rs., trial) %>%
summarise(RelativeFreq = round(MarkerTotal / n(), 2))
我的大型数据框的片段看起来是这样的:
MARKERS.IN.HAPLOTYPES BASE rs. alleles chrom pos GID marker trial
1A.12 C S1A_494392059 C/G 1A 494392059 GID7173723 2 ES26-38
1A.13 C S1A_497201550 C/T 1A 497201550 GID7173723 0 ES26-38
1A.14 T S1A_499864157 C/T 1A 499864157 GID7173723 2 ES26-38
1B.10 A S1B_566171302 G/A 1B 566171302 GID7173723 0 ES26-38
1B.20 G S1B_642616640 A/G 1B 642616640 GID7173723 2 ES26-38
2B.10 A S2B_24883552 A/G 2B 24883552 GID7173723 2 ES26-38
这是其中的一个dput
:
structure(list(MARKERS.IN.HAPLOTYPES = c("1A.12", "1A.13", "1A.14",
"1B.10", "1B.20", "2B.10"), BASE = c("C", "C", "T", "A", "G",
"A"), rs. = c("S1A_494392059", "S1A_497201550", "S1A_499864157",
"S1B_566171302", "S1B_642616640", "S2B_24883552"), alleles = c("C/G",
"C/T", "C/T", "G/A", "A/G", "A/G"), chrom = c("1A", "1A", "1A",
"1B", "1B", "2B"), pos = c(494392059L, 497201550L, 499864157L,
566171302L, 642616640L, 24883552L), GID = c("GID7173723", "GID7173723",
"GID7173723", "GID7173723", "GID7173723", "GID7173723"), marker = c("2",
"0", "2", "0", "2", "2"), trial = c("ES26-38", "ES26-38", "ES26-38",
"ES26-38", "ES26-38", "ES26-38")), row.names = c(NA, 6L), class =
"data.frame")
原始数据框中 rs.
列有 22 个 unique
值,trial
列有六个 unique
值。我想计算每个唯一 rs.
和每个唯一 trial
列 marker
不同值的相对频率。因此,例如,列 rs.
S1A_494392059
的第一项将具有列 marker
用于试验的频率 ES26-38
等等,依此类推。请注意,marker
列是字符向量而不是数字。
你可以试试这个:
library(dplyr)
df %>%
add_count(rs., trial, name = "Total") %>%
add_count(rs., trial, marker, name = "MarkerTotal") %>%
mutate(RelativeFreq = round(MarkerTotal / Total, 2))
add_count
中的 name
列是 dplyr 0.8
之后的新功能,允许您决定名称(以前是 n
或 nn
默认)。如果您没有更新包,上述代码将无法运行。
虽然不是特别复杂,但您示例中的相对频率到处都是 1。
如果您想获得汇总数据框(其中唯一剩下的列将分组 rs.
、trial
和 RelativeFreq
),您可以这样做:
df %>%
add_count(rs., trial, marker, name = "MarkerTotal") %>%
group_by(rs., trial) %>%
summarise(RelativeFreq = round(MarkerTotal / n(), 2))