比较数据框中的两列是否匹配，并由此创建包含匹配项的新数据框

Question

你能再帮我一下吗？

我有一个包含 4 列的数据框，它们要么是基因符号，要么是我分配基因符号的等级，如下所示：

     mb_rank  mb_gene  ts_rank  ts_gene
[1]  1        BIRCA    1        MYCN
[2]  2        MYCN     2        MOB4
[3]  3        ATXN1    3        ABHD17C
[4]  4        ABHD17C  4        AEBP2
5 etc... for up to 6000 rows in some data sets. 
the ts columns are usually a lot longer than the mb columns.

我想安排数据，以便删除非重复项，从而只留下出现在数据框两列中的基因，例如

     mb_rank  mb_gene  ts_rank  ts_gene
[1]  2        MYCN     1        MYCN
[2]  4        ABHD17C  3        ABHD17C

在这个期望结果的例子中，非重复的基因已被删除，只留下开始时出现在两个列表中的基因。

我试过很多东西，比如：

`df[df$mb_gene %in% df$ts_gene,]`

但它不起作用，似乎命中并遗漏了一些基因 2) 我试图编写一个 IF 函数，但我的技能有限。

我希望我已经对此进行了充分的描述，但如果我能澄清任何问题，请询问，我真的被困住了。提前致谢！

Answer 1

在 data.frame 中，通常一行是一个完整的观察结果，这意味着其中的所有数据都（以某种方式）与其余数据相关联。在一项调查中，一行要么是一个人（所有问题），要么是一个人的一个问题。但是，在此处的数据中，您的第一行 BIRCA 和 MYCN 是完全分开的，这意味着您想要删除一个而不删除另一个。在 "data-science-y" 视图中，这对我来说表明您的数据形状不正确。

为了达到您的要求，我们需要将它们拆分成单独的帧。

df <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
mb_rank  mb_gene  ts_rank  ts_gene
1        BIRCA    1        MYCN
2        MYCN     2        MOB4
3        ATXN1    3        ABHD17C
4        ABHD17C  4        AEBP2")

df1 <- df[,1:2]
df2 <- df[,3:4]
df1
#   mb_rank mb_gene
# 1       1   BIRCA
# 2       2    MYCN
# 3       3   ATXN1
# 4       4 ABHD17C
df2
#   ts_rank ts_gene
# 1       1    MYCN
# 2       2    MOB4
# 3       3 ABHD17C
# 4       4   AEBP2

从这里，我们可以使用intersect找到共同基因：

incommon <- intersect(df1$mb_gene, df2$ts_gene)
df1[df1$mb_gene %in% incommon,]
#   mb_rank mb_gene
# 2       2    MYCN
# 4       4 ABHD17C
df2[df2$ts_gene %in% incommon,]
#   ts_rank ts_gene
# 1       1    MYCN
# 3       3 ABHD17C

如果您 100% 确定每行中的行数始终相同，那么您只需 cbind 这些放在一起：

cbind(
  df1[df1$mb_gene %in% incommon,],
  df2[df2$ts_gene %in% incommon,]
)
#   mb_rank mb_gene ts_rank ts_gene
# 2       2    MYCN       1    MYCN
# 4       4 ABHD17C       3 ABHD17C

但是，如果每个数字有可能不同，那么您将运行出问题。如果一个的数量是另一个的倍数，你会得到 "recycling" 的数据和警告，但你仍然会得到数据（我认为这是一个错误）：

cbind(
  df1[df1$mb_gene %in% incommon,],
  df2
)
# Warning in data.frame(..., check.names = FALSE) :
#   row names were found from a short variable and have been discarded
#   mb_rank mb_gene ts_rank ts_gene
# 1       2    MYCN       1    MYCN
# 2       4 ABHD17C       2    MOB4
# 3       2    MYCN       3 ABHD17C
# 4       4 ABHD17C       4   AEBP2

不过，如果不是倍数，你只会得到一个错误：

cbind(
  df1[df1$mb_gene %in% incommon,],
  df2[1:3,]
)
# Error in data.frame(..., check.names = FALSE) : 
#   arguments imply differing number of rows: 2, 3

我建议您考虑一下这种存储结构，因为我相信它推翻了某些工具对帧的行所做的假设。

Answer 2

使用： df_new 是您的新数据框。

df_new = df[df['mb_gene'] == df['ts_gene']]

Answer 3

如果没有更多详细信息，就很难了解边缘情况。无论如何，这听起来像是关系 table 连接。你试过了吗：

d1 = select(df, c(mb_rank, mb_gene))
d2 = select(df, c(ts_rank, ts_gene))
merge(d1, d2, by.x="mb_gene", by.y="ts_gene")

比较数据框中的两列是否匹配，并由此创建包含匹配项的新数据框

Comparing two columns in a data frame for matches and from this creating a new data frame that contains the matches

sorting

r

match

dataframe

columnsorting