从逗号分隔的带有注释的文本文件中读取数据帧列表,自动化

Reading a list of data frames from a comma separated text file with comments, automate

注意:已经进行了三个更新。欢迎提出想法

我有多个文本 (.txt) 文件,它们本质上是数据帧列表,其中包含多个样本中每个样本的数据。
每个样本的每组数据都以引号 ("") 开头,后跟一系列以逗号分隔的字符串形式的注释 ("string")。
我需要为每个样本分离出每组数据,单独的列,并添加带有评论中提供的信息的新列。

我要提取的数据在文件名"file2.ext",注释"Specimen Number"后面的标本编号

数据样本如下。

    ""
    "Test Method","file1.ext"
    "Sample I. D.","file2.ext"
    "Specimen Number","1"

    "A (unit1)","B (unit2)","C (unit3)","D (unit4)","E (%)"

    0.744,0.300,-0.046,0.197,-0.004
    0.903,0.400,0.038,0.239,0.003
    1.096,0.500,0.123,0.290,0.011
    1.314,0.600,0.207,0.348,0.018
    1.532,0.700,0.289,0.406,0.025
    1.776,0.800,0.373,0.471,0.033
    2.029,0.900,0.457,0.538,0.040
    2.282,1.000,0.541,0.605,0.047
    2.533,1.100,0.623,0.671,0.054
    2.783,1.200,0.707,0.738,0.062
    3.044,1.300,0.792,0.807,0.069
    3.319,1.400,0.876,0.880,0.076
    3.587,1.500,0.958,0.951,0.084


    ""
    "Test Method","file1.ext"
    "Sample I. D.","file2.ext"
    "Specimen Number","2"

    "A (unit1)","B (unit2)","C (unit3)","D (unit4)","E (%)"

    0.755,0.300,-0.055,0.218,-0.005
    0.918,0.400,0.030,0.265,0.003
    1.137,0.500,0.114,0.328,0.010
    1.377,0.600,0.198,0.397,0.017
    1.626,0.700,0.282,0.469,0.024
    1.874,0.800,0.365,0.541,0.031
    2.136,0.900,0.450,0.616,0.038
    2.400,1.000,0.533,0.692,0.045
    2.667,1.100,0.615,0.770,0.051
    2.935,1.200,0.699,0.847,0.058
    3.221,1.300,0.784,0.930,0.066
    3.505,1.400,0.867,1.011,0.072
    3.804,1.500,0.949,1.098,0.079

我已经能够构建可用的数据帧,但需要知道两件事: 1 -- 是否有更简单的读取方法可以让我将文件读入列表? 2 -- 如何自动读取文件并构建包括新列在内的最终数据帧?

  1. 使用 scan() 将文本文件读入 R 生成一个包含所有注释的字符向量。

    character.vector <- scan("file_name.txt", "")
    
  2. 在 'character.vector' 中找到评论 "Test Method" 出现的位置,以使用 grep()

    识别每个标本
    specimen.vector <-grep(pattern= "Test Method", character.vector)
    > tm_test.2
    [1]    1  382  764 1146 1528 1910 2292 2674 3056 3438 3820 4202
    
  3. 确定子集 'character.vector' 为单个样本构建新数据框所需的索引

    > specimen.start.at <- specimen.vector + 24
    > specimen.start.at
    [1]   25  406  788 1170 1552 1934 2316 2698 3080 3462 3844 4226
    
    > specimen.stop.at <- specimen.vector + 381
    > specimen.stop.at
    [1]  382  763 1145 1527 1909 2291 2673 3055 3437 3819 4201 4583
    

    有 12 个标本带有向量 'specimen.start.at' 和 'specimen.stop.at' 指示的 idices。 例如,样本 1 的数据(不包括注释)跨越 25:382 in 'character.vector'.

  4. 我没有弄清楚如何为每个样本自动提取数据,所以我手动输入了如下索引

    start <- specimen.start.at[specimen_number]
    finish <- specimen.stop.at[specimen_number]
    specimen.dataframe <- character.vector[start:finish] %>% strsplit(split = ",", fixed = TRUE) %>% ldply %>% tbl_df
    

每个标本的输出是一个包含 5 个未标记列的数据框。

          V1    V2    V3    V4    V5
    1  1.073 0.400 0.215 0.198 0.022
    2  1.315 0.500 0.299 0.242 0.031
    3  1.562 0.600 0.382 0.288 0.040
    4  1.840 0.700 0.466 0.339 0.049
    5  2.135 0.800 0.550 0.393 0.058
    6  2.438 0.900 0.634 0.449 0.066
    7  2.740 1.000 0.716 0.505 0.075
    8  3.046 1.100 0.800 0.561 0.084
    9  3.349 1.200 0.884 0.617 0.092
    10 3.660 1.300 0.969 0.674 0.101
    ..   ...   ...   ...   ...   ...

这是扫描的输出:

    [1] "Test Method"                      ",\"XXX"                          "YYY"                            
    [4] "test"                             "-"                               "Edit"                           
    [7] "ABB"                              "1-8-08.ext\""                    "Sample I. D."                   
    [10] ",\"1000"                         "gsm"                             "string"                       
    [13] "string"                          "ab"                              "20796-87.ext\""                 
    [16] "Specimen Number"                 ",\"1\""                          "A (unit1)"                       
    [19] ",\"B"                            "(unit2)\",\"C"                   "(unit3)\",\"D"                
    [22] "(unit4)\",\"E"                   "(%)\""                           "0.744,0.300,-0.046,0.197,-0.004"
    [25] "0.903,0.400,0.038,0.239,0.003"   "1.096,0.500,0.123,0.290,0.011"   "1.314,0.600,0.207,0.348,0.018"

一旦得到数据框,我将添加包括标本编号在内的几列,将它们合并为该特定文件的一个组合数据框,重命名列,然后构建一个包含每个文件数据的列表。我想我可能需要编写一个包含某些版本的 apply 函数族的函数。我想远离 for 循环。

使用了 Rstudio。

Session 信息

    > sessionInfo()
    R version 3.2.1 (2015-06-18)
    Platform: x86_64-w64-mingw32/x64 (64-bit)
    Running under: Windows 7 x64 (build 7601) Service Pack 1

    locale:
    [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
    [5] LC_TIME=English_United States.1252    

    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     

    other attached packages:
    [1] plyr_1.8.3  tidyr_0.2.0 dplyr_0.4.2

    loaded via a namespace (and not attached):
    [1] magrittr_1.5   R6_2.0.1       assertthat_0.1 parallel_3.2.1 DBI_0.3.1      tools_3.2.1    Rcpp_0.11.6

我想提前感谢大家的帮助。

更新 进一步调查显示我可以使用 read.csv() 读取文本文件,将数据放入 R 中的数据框中。

当我使用下面的代码时
df <- read.csv("file.txt", sep = c(",", "\n"), header = F, stringsAsFactors= F)
生成的数据框如下所示

                 V1                          V2             V3           V4         V5
    1       Test Method                   file1.ext                                       
    2      Sample I. D.                   file2.ext                                       
    3   Specimen Number                           1                                       
    4         A (unit1)                   B (unit2)      C (unit3)    D (unit4)      E (%)
    5            -0.150                       0.000         -0.198       -0.006    -14.671
    6            -0.147                       0.100         -0.198       -0.006    -14.671
    7            -0.190                       0.300         -0.194       -0.007    -14.383
    8            -0.177                       0.400         -0.191       -0.007    -14.135
    9            -0.163                       0.500         -0.188       -0.006    -13.891
    203     Test Method                   file1.ext                                       
    204    Sample I. D.                   file2.ext                                       
    205 Specimen Number                           2                                       
    206       A (unit1)                   B (unit2)      C (unit3)    D (unit4)      E (%)
    207          -0.206                       0.000         -0.162       -0.008    -11.967
    208          -0.201                       0.100         -0.162       -0.008    -11.967
    209          -0.242                       0.300         -0.158       -0.010    -11.679
    210          -0.223                       0.400         -0.154       -0.009    -11.435
    211          -0.222                       0.500         -0.151       -0.009    -11.187
    212          -0.216                       0.600         -0.148       -0.009    -10.939

这只是我的一部分。我之前的问题成立。再次,提前谢谢大家

更新 -- 澄清

我想提交另一个更新来阐明我想要什么样的输出,所以我决定手动进行以下操作。如前所述,我知道每个样本的数据从哪里开始。索引存储在向量中。

    specimen.record.start <- grep(pattern = "Test Method", data.file1$V1)
    > specimen.record.start
    [1]    1  363  726 1089 1452 1815 2178 2541 2904 3267 3630 3993

我对 12 个样本中的每一个都使用 slice(),并使用 specimen.record.start 中的索引来为切片选择正确的起点和终点。

    spec1.dfa <- data.file1 %>% slice(1:362)
    spec2.dfa <- data.file1 %>% slice(363:725)
    spec3.dfa <- data.file1 %>% slice(726:1088)
    spec4.dfa <- data.file1 %>% slice(1089:1451)
    spec5.dfa <- data.file1 %>% slice(1452:1814)
    spec6.dfa <- data.file1 %>% slice(1815:2177)
    spec7.dfa <- data.file1 %>% slice(2178:2540)
    spec8.dfa <- data.file1 %>% slice(2541:2903)
    spec9.dfa <- data.file1 %>% slice(2904:3266)
    spec10.dfa <- data.file1 %>% slice(3267:3629)
    spec11.dfa <- data.file1 %>% slice(3630:3992)
    spec12.dfa <- data.file1 %>% slice(3993:4355)

然后我构建了我想要的数据框如下:

    > spec1.dfa %>% filter(row_number() > 4) %>% rename(A = V1, B = V2, C = V3, D = V4, E = V5)         %>% mutate(specimen = 1, F = 1000, G = "cross", H = TRUE)
    Source: local data frame [358 x 9]

           A     B         C      D      E specimen    F         G      H
    1  0.744 0.300    -0.046  0.197 -0.004        1 1000     cross   TRUE
    2  0.903 0.400     0.038  0.239  0.003        1 1000     cross   TRUE
    3  1.096 0.500     0.123  0.290  0.011        1 1000     cross   TRUE
    4  1.314 0.600     0.207  0.348  0.018        1 1000     cross   TRUE
    5  1.532 0.700     0.289  0.406  0.025        1 1000     cross   TRUE
    6  1.776 0.800     0.373  0.471  0.033        1 1000     cross   TRUE
    7  2.029 0.900     0.457  0.538  0.040        1 1000     cross   TRUE
    8  2.282 1.000     0.541  0.605  0.047        1 1000     cross   TRUE
    9  2.533 1.100     0.623  0.671  0.054        1 1000     cross   TRUE
    10 2.783 1.200     0.707  0.738  0.062        1 1000     cross   TRUE
    ..   ...   ...       ...    ...    ...      ...  ...       ...    ...

同样,我想使用一些东西来自动为文件中的每个标本记录插入索引号。还请记住,我拥有的文本文件包含多个特定于每个标本的记录。在这种情况下有十二个标本,但其他文件可能有更多或更少的标本。

再次,提前谢谢大家

最终更新 -- 简化,也许

我想包括一个最后的更新,它显示了我以前手动实现的代码版本,我真的很想只调用一个函数。如前所述,保存了两个向量以将索引显示为整数,表示原始文件中每个样本的数据开始和结束位置。

    # > specimen.record.start
    # [1]    1  363  726 1089 1452 1815 2178 2541 2904 3267 3630 3993
    # > specimen.record.stop
    # [1]  362  725 1088 1451 1814 2177 2540 2903 3266 3629 3992 4355
    # > class(specimen.record.start)
    # [1] "integer"
    # > class(specimen.record.stop)
    # [1] "integer"

之前的更新实现了同样的事情,只是索引号是手动输入到切片函数中的。下面我使用括号将索引号替换为矢量选择。理想情况下,我想调用几行代码的单个函数来遍历切片。我为每个切片数据框分配了自己的名称,但我认为它们都可以被输入到一个空数据框中。我只是不确定该怎么做。

    # Again to illustrate create the data frames manually.
    # The following is a set of data frames sliced from the orignial data

    # > spec1.dfb <- data.file1 %>% slice(specimen.record.start[1] : specimen.record.stop[1])
    # > spec2.dfb <- data.file1 %>% slice(specimen.record.start[2] : specimen.record.stop[2])
    # > spec3.dfb <- data.file1 %>% slice(specimen.record.start[3] : specimen.record.stop[3])
    # > spec4.dfb <- data.file1 %>% slice(specimen.record.start[4] : specimen.record.stop[4])
    # > spec5.dfb <- data.file1 %>% slice(specimen.record.start[5] : specimen.record.stop[5])
    # > spec6.dfb <- data.file1 %>% slice(specimen.record.start[6] : specimen.record.stop[6])
    # > spec7.dfb <- data.file1 %>% slice(specimen.record.start[7] : specimen.record.stop[7])
    # > spec8.dfb <- data.file1 %>% slice(specimen.record.start[8] : specimen.record.stop[8])
    # > spec9.dfb <- data.file1 %>% slice(specimen.record.start[9] : specimen.record.stop[9])
    # > spec10.dfb <- data.file1 %>% slice(specimen.record.start[10] : specimen.record.stop[10])
    # > spec11.dfb <- data.file1 %>% slice(specimen.record.start[11] : specimen.record.stop[11])
    # > spec12.dfb <- data.file1 %>% slice(specimen.record.start[12] : specimen.record.stop[12])

然后使用管道运算符 %>% 过滤切片数据帧,以仅提取数据并排除在每组新样本数据开头找到的评论。我还会改变这些数据框以添加一些额外的列,如上次更新所示,并且我会重命名标记为 V1 到 V5 的列。但为了简单起见,我只展示了 fil呃下面。请注意,row_number() > 4 表示注释在切片数据帧中停止的位置。同样,理想情况下,我想对每个数据帧(或数据集)进行迭代过滤。

    # The following is a set of data frames filtered from the sliced data frames to exclue comment lines

    # > spec1.dfc <- spec1.dfb %>% filter(row_number() > 4)
    # > spec2.dfc <- spec2.dfb %>% filter(row_number() > 4)
    # > spec3.dfc <- spec3.dfb %>% filter(row_number() > 4)
    # > spec4.dfc <- spec4.dfb %>% filter(row_number() > 4)
    # > spec5.dfc <- spec5.dfb %>% filter(row_number() > 4)
    # > spec6.dfc <- spec6.dfb %>% filter(row_number() > 4)
    # > spec7.dfc <- spec7.dfb %>% filter(row_number() > 4)
    # > spec8.dfc <- spec8.dfb %>% filter(row_number() > 4)
    # > spec9.dfc <- spec9.dfb %>% filter(row_number() > 4)
    # > spec10.dfc <- spec10.dfb %>% filter(row_number() > 4)
    # > spec11.dfc <- spec11.dfb %>% filter(row_number() > 4)
    # > spec12.dfc <- spec12.dfb %>% filter(row_number() > 4)

最后,所有切片和过滤的数据帧都是行绑定的,以创建包含所有数据的最终数据帧。

    all.dfc <- rbind(spec1.dfc, spec2.dfc, spec3.dfc,
                     spec4.dfc, spec5.dfc, spec6.dfc,
                     spec7.dfc, spec8.dfc, spec9.dfc,
                     spec10.dfc, spec11.dfc, spec12.dfc)
    # > all.dfc
    # Source: local data frame [4,307 x 5]
    # 
    #       V1    V2     V3    V4     V5
    # 1  0.744 0.300 -0.046 0.197 -0.004
    # 2  0.903 0.400  0.038 0.239  0.003
    # 3  1.096 0.500  0.123 0.290  0.011
    # 4  1.314 0.600  0.207 0.348  0.018
    # 5  1.532 0.700  0.289 0.406  0.025
    # 6  1.776 0.800  0.373 0.471  0.033
    # 7  2.029 0.900  0.457 0.538  0.040
    # 8  2.282 1.000  0.541 0.605  0.047
    # 9  2.533 1.100  0.623 0.671  0.054
    # 10 2.783 1.200  0.707 0.738  0.062
    # ..   ...   ...    ...   ...    ...

总而言之,需要将数据读入R,然后需要将文件分成(切片)对应于每个单独样本的部分。每个数据块都是特定样本的特定测试所特有的数据。然后需要过滤块(切片),并将其组合成一个数据框,并添加新的列。我已经尝试了 apply 系列中的几个循环函数,但似乎都让我望而却步。我正在考虑执行类似以下操作的功能。

注意:以下仅供参考,并非实际代码

    my_function <- function {
        my_data <- read(my_files)[i]
        my_final_data_frame <- my_data %>% slice(my_data) %>% filter(my_data) %>% mutate(my_data) %>% rename(my_data)
        repeat

    }

my_function代码为伪代码,仅为说明概念而给出,不代表对编码有任何理解。我不确定它究竟会怎么写。如果有人有任何想法,我欢迎他们。

再次感谢。

我发现下面的代码解决了这个问题。详情请参阅问题中提供的信息。该代码从对应于不同数据集的多个表单或块中提取数据,然后将它们放在单独的数据框中。首先,找到文件并将其读入列表,然后组合成一个完整的主数据框。其次,确定切片的起点和终点的切片点(索引)。第三,切片在确定的索引处执行,遍历索引,并将切片放入临时列表中。最后,操作切片索引列表以添加列和重命名列。

    # load the required packages
      require(plyr)
      require(dplyr)
      require(tidyr)

    # find file names
      file.names <- list.files(pattern = ".txt")

    # read files
      read.files <- vector(mode = "list", length = length(file.names))
      read.files <- lapply(file.names, read.csv, sep = c(",", "\n"), header = F, stringsAsFactors = F)
      read.files.df <- ldply(read.files)

    # find indices
      index.slice.start <- grep(pattern = "Test Method", read.files.df$V1)
      minus_1 <- function(x) x - 1
      stop.vector <- sapply(index.slice.start, minus_1)
      stop.vector <- as.integer(sapply(index.slice.start, minus_1))
      index.slice.stop <- c(stop.vector[-1], nrow(read.files.df))

    # slice dataframes
      tmp <- vector(mode = "list", length = length(index.slice.start))

      for (i in 1:length(index.slice.start)) {
        r <- index.slice.start[i]
        p <- index.slice.stop[i]
        tmp[[i]] <- slice(read.files.df, index.slice.start[i]:index.slice.stop[i])

      }

    # Construct the dataframe
    all.df <- lapply(tmp, mutate, specimen = V2[3], sample = V2[2]) %>%
        lapply(filter, row_number() > 4) %>%
        lapply(rename, Variable1 = V1, Variable2 = V2, Variable3 = V3, Variable4 = V4, Variable5 = V5) %>%
        ldply %>%
        separate(sample, into = c("A", "B", "C", "D", "E", "F", "G"), sep = " ", extra = "merge") %>%
        rename(weight = A, gsm = B, null = C, agent = D, condition = E, direction = F, id = G) %>%
        select(-null, -gsm)