R：读取和操作一个格式奇怪的文件

Question

我有一个格式有点奇怪的文件，如下所示：

Cluster 1       Score:3.96  
Category        Term        Count
GOTERM_BP_FAT   GO:0006412  34
KEGG_PATHWAY    hsa00970    9
GOTERM_BP_FAT   GO:0043038  9
GOTERM_BP_FAT   GO:0043039  9

Cluster 2       Score:3.94  
Category        Term        Count
GOTERM_BP_FAT   GO:0006414  21
KEGG_PATHWAY    hsa03010    20
GOTERM_BP_FAT   GO:0034660  16
GOTERM_BP_FAT   GO:0006399  11
GOTERM_BP_FAT   GO:0042254  10
GOTERM_BP_FAT   GO:0022613  12

... 以及 "sub-data frames"（包括中间的 space）和 Cluster X 行之后的行的附加（此处省略）列。

我想做的是以某种方式读取每个单独的集群，将其作为数据框获取（即名称为 Category, Term, Count 的数据框），对数据框进行一些操作（添加基于计算，主要是），然后将操纵的数据框和 Cluster X 行以与开始时完全相同的格式写入新文件。

我绞尽脑汁想出一些巧妙的方法来做到这一点，但除了分别读取每一行并根据行的类型做不同的事情之外，我还没有真正想出任何其他方法，就像这样：

con  <- file('test.txt', open="r")

# Read file line for line
while ( length(currentLine <- readLines(con, n=1, warn=FALSE)) > 0 ) {
  line = strsplit(currentLine, '\t')[[1]]

  # Save previous data, initiate new cluster name/score
  if ( grepl('Annotation Cluster', line[1]) ) {

    # Save previous data if available
    if ( exists('currentData') ) {
      ## save the current data somehow
    }

    # Initiate new
    clusterInfo = line
  } 
  # Initiate new, empty data frame
  else if ( grepl('Category', line[1]) ) {
    currentData = data.frame(t(rep(NA, length(line))))
    names(currentData) = line
  } 
  # Add data to data frame
  else if ( grepl('GOTERM', line[1]) || grepl('KEGG', line[1]) ) {
    currentData = rbind(currentData, line)

    # Delete NAs if line row
    if ( nrow(currentData) == 2 ) {
      currentData = na.omit(currentData)
    }
  }
}

上面显然没有说完（我不知道如何把clusterInfo和currentData保存成相同的格式），但我希望能理解我的想法。不过，我不太喜欢这种方法......对我来说，像这样逐行创建数据帧，并尝试在你启动的同时保存数据，这似乎很奇怪下一个数据块的开始。

有更好的方法吗？

Answer 1

你可以试试read.mtable from my GitHub-only "SOfun" package。

用法类似于：

library(SOfun)
read.mtable(x, "Cluster", header = TRUE) ## Replace "x" with your file name
# $`Cluster 1       Score:3.96`
#        Category       Term Count
# 1 GOTERM_BP_FAT GO:0006412    34
# 2  KEGG_PATHWAY   hsa00970     9
# 3 GOTERM_BP_FAT GO:0043038     9
# 4 GOTERM_BP_FAT GO:0043039     9
# 
# $`Cluster 2       Score:3.94`
#        Category       Term Count
# 1 GOTERM_BP_FAT GO:0006414    21
# 2  KEGG_PATHWAY   hsa03010    20
# 3 GOTERM_BP_FAT GO:0034660    16
# 4 GOTERM_BP_FAT GO:0006399    11
# 5 GOTERM_BP_FAT GO:0042254    10
# 6 GOTERM_BP_FAT GO:0022613    12

如您所见，"cluster" 信息保留为列表名称。因此，您可以继续使用 lapply 进行任何您需要做的计算，然后以您需要的任何形式重写数据。

可重现的示例数据：

x <- tempfile()

writeLines("Cluster 1       Score:3.96  
Category        Term        Count
GOTERM_BP_FAT   GO:0006412  34
KEGG_PATHWAY    hsa00970    9
GOTERM_BP_FAT   GO:0043038  9
GOTERM_BP_FAT   GO:0043039  9

Cluster 2       Score:3.94  
Category        Term        Count
GOTERM_BP_FAT   GO:0006414  21
KEGG_PATHWAY    hsa03010    20
GOTERM_BP_FAT   GO:0034660  16
GOTERM_BP_FAT   GO:0006399  11
GOTERM_BP_FAT   GO:0042254  10
GOTERM_BP_FAT   GO:0022613  12", con = x, sep = "\n")

Answer 2

您可以使用 readLines 和 split 读取文件，并使用基于具有 'Cluster' 的行创建的数字索引 ('indx')。使用 read.table 读取列表元素，创建两个新列（'Cluster' 和 'Score'）并绑定列表元素以创建单个数据集。

lines <- readLines('Clusterfile.txt')
indx <- cumsum(grepl('^Cluster', lines))
res <-  do.call(rbind,lapply(split(lines, indx), function(x) {
        d1 <-read.table(text=x[-1], header=TRUE, stringsAsFactors=FALSE)
        d2 <- read.table(text=gsub('[^0-9.]+', ' ', x[1]), 
          col.names=c('Cluster', 'Score'))
        cbind(d1, d2)}))

row.names(res) <- NULL
head(res,3)
#       Category       Term Count Cluster Score
#1 GOTERM_BP_FAT GO:0006412    34       1  3.96
#2  KEGG_PATHWAY   hsa00970     9       1  3.96
#3 GOTERM_BP_FAT GO:0043038     9       1  3.96

R：读取和操作一个格式奇怪的文件

R: reading and manipulating a strangely formatted file

r

bioinformatics

dataframe