从多个 MySQL 表中获取具有合并值的 R 数据框

Get an R dataframe with merged values from multiple MySQL tables

我有一个 MySQL 数据库,其中包含许多大型 table,格式如下:

mysql> select * from Table1 limit 2;
+-------+----------+-------------+
| chrom | site     | methylation |
+-------+----------+-------------+
|     1 | 10003581 |          76 |
|     1 | 10003584 |           0 |
+-------+----------+-------------+

我想在 R 中创建一个大型合并的 table,它将包含每个 table 的甲基化值覆盖的所有位点。例如,如果我有 4 mysql tables,R 数据框将包含以下列:

chrom    site    table1    table2    table3    table4

到目前为止我有:

library(RMySQL)

#Open database
mydb = dbConnect(MySQL(), user='root', password='', dbname='DataBase')

#Create function to get values
GetVal <- function(TableName, ColumnName){
  rs = dbSendQuery(mydb, paste("SELECT chrom, site, methylation FROM ", TableName))
  data = fetch(rs, n=-1)
  res <- rename(data, c("chrom" = "Chr", "site" = "start", "methylation" = ColumnName))
  return(res)
}

Table1 <- GetVal("Table1", "Table1")
Table2 <- GetVal("Table2", "Table2")
Table3 <- GetVal("Table3", "Table3")
Table4 <- GetVal("Table4", "Table4")

然后我会将所有 table 合并在一起。但是我认为应该有一种更快、更有效的方法来做到这一点。

试试这个

dbSendQuery(mydb, 'insert into chrom_sites
select distinct chrom,site from table1
union
select distinct chrom,site from table2
union
select distinct chrom,site from table3
union
select distinct chrom,site from table4
union
select distinct chrom,site from table5')

x <- dbSendQuery(mydb, 'select chrom,
site,
t1.methylation as table1,
t2.methylation as table2,
t3.methylation as table3,
t4.methylation as table4,
t5.methalation as table5
from chrom_sites as a
join table1 as t1 on a.chrom = t1.chrom and a.site = t1.site
join table2 as t2 on a.chrom = t2.chrom and a.site = t2.site
join table3 as t3 on a.chrom = t3.chrom and a.site = t3.site
join table4 as t4 on a.chrom = t4.chrom and a.site = t4.site
join table5 as t5 on a.chrom = t5.chrom and a.site = t5.site')

这应该做的是在 MySQL 中创建一个 chrom_sites table,其中包含 chrom 和 site 的唯一值。

之后,它以此为起点,然后以您想要的方式填充 table(数据框)。

第一部分可能有更好的方法,但我不确定。如果您有很多 table,那么编写一个函数来执行此操作可能很有意义。

假设您要处理的表的数量是可变的,这会更笼统。它还按照您在原始函数中想要的方式重命名列:

library(RMySQL)

##  Open database:
mydb = dbConnect(MySQL(), user='root', password='', dbname='DataBase')

##  Create function to get values:
GetVals <- function(TableNames) {
    query <- paste0("SELECT ", Tables[1], ".Chr AS chrom, ", Tables[1], ".start AS site, ")
    query <- paste0(query, paste0(Tables, ".methylation AS ", Tables, collapse=", "))
    query <- paste0(query, " FROM ", Tables[1], paste0(" JOIN ", Tables[-1], " ON ", Tables[1], ".Chr=", Tables[-1], ".Chr AND ", Tables[1], ".start=", Tables[-1], ".start", collapse=""))

  rs <- dbSendQuery(mydb, query)
  data <- fetch(rs, n=-1)
  return(data)
}

Tables <- c("Table1", "Table2", "Table3", "Table4")

my_data <- GetVals(Tables)

这是为上面的 Tables 变量生成的查询:

> query
[1] "SELECT Table1.Chr AS chrom, Table1.start AS site, Table1.methylation AS Table1, Table2.methylation AS Table2, Table3.methylation AS Table3, Table4.methylation AS Table4 FROM Table1 JOIN Table2 ON Table1.Chr=Table2.Chr AND Table1.start=Table2.start JOIN Table3 ON Table1.Chr=Table3.Chr AND Table1.start=Table3.start JOIN Table4 ON Table1.Chr=Table4.Chr AND Table1.start=Table4.start"