从逗号分隔的带有注释的文本文件中读取数据帧列表,自动化
Reading a list of data frames from a comma separated text file with comments, automate
注意:已经进行了三个更新。欢迎提出想法
我有多个文本 (.txt) 文件,它们本质上是数据帧列表,其中包含多个样本中每个样本的数据。
每个样本的每组数据都以引号 ("") 开头,后跟一系列以逗号分隔的字符串形式的注释 ("string")。
我需要为每个样本分离出每组数据,单独的列,并添加带有评论中提供的信息的新列。
我要提取的数据在文件名"file2.ext",注释"Specimen Number"后面的标本编号
数据样本如下。
""
"Test Method","file1.ext"
"Sample I. D.","file2.ext"
"Specimen Number","1"
"A (unit1)","B (unit2)","C (unit3)","D (unit4)","E (%)"
0.744,0.300,-0.046,0.197,-0.004
0.903,0.400,0.038,0.239,0.003
1.096,0.500,0.123,0.290,0.011
1.314,0.600,0.207,0.348,0.018
1.532,0.700,0.289,0.406,0.025
1.776,0.800,0.373,0.471,0.033
2.029,0.900,0.457,0.538,0.040
2.282,1.000,0.541,0.605,0.047
2.533,1.100,0.623,0.671,0.054
2.783,1.200,0.707,0.738,0.062
3.044,1.300,0.792,0.807,0.069
3.319,1.400,0.876,0.880,0.076
3.587,1.500,0.958,0.951,0.084
""
"Test Method","file1.ext"
"Sample I. D.","file2.ext"
"Specimen Number","2"
"A (unit1)","B (unit2)","C (unit3)","D (unit4)","E (%)"
0.755,0.300,-0.055,0.218,-0.005
0.918,0.400,0.030,0.265,0.003
1.137,0.500,0.114,0.328,0.010
1.377,0.600,0.198,0.397,0.017
1.626,0.700,0.282,0.469,0.024
1.874,0.800,0.365,0.541,0.031
2.136,0.900,0.450,0.616,0.038
2.400,1.000,0.533,0.692,0.045
2.667,1.100,0.615,0.770,0.051
2.935,1.200,0.699,0.847,0.058
3.221,1.300,0.784,0.930,0.066
3.505,1.400,0.867,1.011,0.072
3.804,1.500,0.949,1.098,0.079
我已经能够构建可用的数据帧,但需要知道两件事:
1 -- 是否有更简单的读取方法可以让我将文件读入列表?
2 -- 如何自动读取文件并构建包括新列在内的最终数据帧?
使用 scan()
将文本文件读入 R 生成一个包含所有注释的字符向量。
character.vector <- scan("file_name.txt", "")
在 'character.vector' 中找到评论 "Test Method" 出现的位置,以使用 grep()
识别每个标本
specimen.vector <-grep(pattern= "Test Method", character.vector)
> tm_test.2
[1] 1 382 764 1146 1528 1910 2292 2674 3056 3438 3820 4202
确定子集 'character.vector' 为单个样本构建新数据框所需的索引
> specimen.start.at <- specimen.vector + 24
> specimen.start.at
[1] 25 406 788 1170 1552 1934 2316 2698 3080 3462 3844 4226
> specimen.stop.at <- specimen.vector + 381
> specimen.stop.at
[1] 382 763 1145 1527 1909 2291 2673 3055 3437 3819 4201 4583
有 12 个标本带有向量 'specimen.start.at' 和 'specimen.stop.at' 指示的 idices。
例如,样本 1 的数据(不包括注释)跨越 25:382 in 'character.vector'.
我没有弄清楚如何为每个样本自动提取数据,所以我手动输入了如下索引
start <- specimen.start.at[specimen_number]
finish <- specimen.stop.at[specimen_number]
specimen.dataframe <- character.vector[start:finish] %>% strsplit(split = ",", fixed = TRUE) %>% ldply %>% tbl_df
每个标本的输出是一个包含 5 个未标记列的数据框。
V1 V2 V3 V4 V5
1 1.073 0.400 0.215 0.198 0.022
2 1.315 0.500 0.299 0.242 0.031
3 1.562 0.600 0.382 0.288 0.040
4 1.840 0.700 0.466 0.339 0.049
5 2.135 0.800 0.550 0.393 0.058
6 2.438 0.900 0.634 0.449 0.066
7 2.740 1.000 0.716 0.505 0.075
8 3.046 1.100 0.800 0.561 0.084
9 3.349 1.200 0.884 0.617 0.092
10 3.660 1.300 0.969 0.674 0.101
.. ... ... ... ... ...
这是扫描的输出:
[1] "Test Method" ",\"XXX" "YYY"
[4] "test" "-" "Edit"
[7] "ABB" "1-8-08.ext\"" "Sample I. D."
[10] ",\"1000" "gsm" "string"
[13] "string" "ab" "20796-87.ext\""
[16] "Specimen Number" ",\"1\"" "A (unit1)"
[19] ",\"B" "(unit2)\",\"C" "(unit3)\",\"D"
[22] "(unit4)\",\"E" "(%)\"" "0.744,0.300,-0.046,0.197,-0.004"
[25] "0.903,0.400,0.038,0.239,0.003" "1.096,0.500,0.123,0.290,0.011" "1.314,0.600,0.207,0.348,0.018"
一旦得到数据框,我将添加包括标本编号在内的几列,将它们合并为该特定文件的一个组合数据框,重命名列,然后构建一个包含每个文件数据的列表。我想我可能需要编写一个包含某些版本的 apply 函数族的函数。我想远离 for 循环。
使用了 Rstudio。
Session 信息
> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] plyr_1.8.3 tidyr_0.2.0 dplyr_0.4.2
loaded via a namespace (and not attached):
[1] magrittr_1.5 R6_2.0.1 assertthat_0.1 parallel_3.2.1 DBI_0.3.1 tools_3.2.1 Rcpp_0.11.6
我想提前感谢大家的帮助。
更新
进一步调查显示我可以使用 read.csv()
读取文本文件,将数据放入 R 中的数据框中。
当我使用下面的代码时
df <- read.csv("file.txt", sep = c(",", "\n"), header = F, stringsAsFactors= F)
生成的数据框如下所示
V1 V2 V3 V4 V5
1 Test Method file1.ext
2 Sample I. D. file2.ext
3 Specimen Number 1
4 A (unit1) B (unit2) C (unit3) D (unit4) E (%)
5 -0.150 0.000 -0.198 -0.006 -14.671
6 -0.147 0.100 -0.198 -0.006 -14.671
7 -0.190 0.300 -0.194 -0.007 -14.383
8 -0.177 0.400 -0.191 -0.007 -14.135
9 -0.163 0.500 -0.188 -0.006 -13.891
203 Test Method file1.ext
204 Sample I. D. file2.ext
205 Specimen Number 2
206 A (unit1) B (unit2) C (unit3) D (unit4) E (%)
207 -0.206 0.000 -0.162 -0.008 -11.967
208 -0.201 0.100 -0.162 -0.008 -11.967
209 -0.242 0.300 -0.158 -0.010 -11.679
210 -0.223 0.400 -0.154 -0.009 -11.435
211 -0.222 0.500 -0.151 -0.009 -11.187
212 -0.216 0.600 -0.148 -0.009 -10.939
这只是我的一部分。我之前的问题成立。再次,提前谢谢大家
更新 -- 澄清
我想提交另一个更新来阐明我想要什么样的输出,所以我决定手动进行以下操作。如前所述,我知道每个样本的数据从哪里开始。索引存储在向量中。
specimen.record.start <- grep(pattern = "Test Method", data.file1$V1)
> specimen.record.start
[1] 1 363 726 1089 1452 1815 2178 2541 2904 3267 3630 3993
我对 12 个样本中的每一个都使用 slice()
,并使用 specimen.record.start
中的索引来为切片选择正确的起点和终点。
spec1.dfa <- data.file1 %>% slice(1:362)
spec2.dfa <- data.file1 %>% slice(363:725)
spec3.dfa <- data.file1 %>% slice(726:1088)
spec4.dfa <- data.file1 %>% slice(1089:1451)
spec5.dfa <- data.file1 %>% slice(1452:1814)
spec6.dfa <- data.file1 %>% slice(1815:2177)
spec7.dfa <- data.file1 %>% slice(2178:2540)
spec8.dfa <- data.file1 %>% slice(2541:2903)
spec9.dfa <- data.file1 %>% slice(2904:3266)
spec10.dfa <- data.file1 %>% slice(3267:3629)
spec11.dfa <- data.file1 %>% slice(3630:3992)
spec12.dfa <- data.file1 %>% slice(3993:4355)
然后我构建了我想要的数据框如下:
> spec1.dfa %>% filter(row_number() > 4) %>% rename(A = V1, B = V2, C = V3, D = V4, E = V5) %>% mutate(specimen = 1, F = 1000, G = "cross", H = TRUE)
Source: local data frame [358 x 9]
A B C D E specimen F G H
1 0.744 0.300 -0.046 0.197 -0.004 1 1000 cross TRUE
2 0.903 0.400 0.038 0.239 0.003 1 1000 cross TRUE
3 1.096 0.500 0.123 0.290 0.011 1 1000 cross TRUE
4 1.314 0.600 0.207 0.348 0.018 1 1000 cross TRUE
5 1.532 0.700 0.289 0.406 0.025 1 1000 cross TRUE
6 1.776 0.800 0.373 0.471 0.033 1 1000 cross TRUE
7 2.029 0.900 0.457 0.538 0.040 1 1000 cross TRUE
8 2.282 1.000 0.541 0.605 0.047 1 1000 cross TRUE
9 2.533 1.100 0.623 0.671 0.054 1 1000 cross TRUE
10 2.783 1.200 0.707 0.738 0.062 1 1000 cross TRUE
.. ... ... ... ... ... ... ... ... ...
同样,我想使用一些东西来自动为文件中的每个标本记录插入索引号。还请记住,我拥有的文本文件包含多个特定于每个标本的记录。在这种情况下有十二个标本,但其他文件可能有更多或更少的标本。
再次,提前谢谢大家
最终更新 -- 简化,也许
我想包括一个最后的更新,它显示了我以前手动实现的代码版本,我真的很想只调用一个函数。如前所述,保存了两个向量以将索引显示为整数,表示原始文件中每个样本的数据开始和结束位置。
# > specimen.record.start
# [1] 1 363 726 1089 1452 1815 2178 2541 2904 3267 3630 3993
# > specimen.record.stop
# [1] 362 725 1088 1451 1814 2177 2540 2903 3266 3629 3992 4355
# > class(specimen.record.start)
# [1] "integer"
# > class(specimen.record.stop)
# [1] "integer"
之前的更新实现了同样的事情,只是索引号是手动输入到切片函数中的。下面我使用括号将索引号替换为矢量选择。理想情况下,我想调用几行代码的单个函数来遍历切片。我为每个切片数据框分配了自己的名称,但我认为它们都可以被输入到一个空数据框中。我只是不确定该怎么做。
# Again to illustrate create the data frames manually.
# The following is a set of data frames sliced from the orignial data
# > spec1.dfb <- data.file1 %>% slice(specimen.record.start[1] : specimen.record.stop[1])
# > spec2.dfb <- data.file1 %>% slice(specimen.record.start[2] : specimen.record.stop[2])
# > spec3.dfb <- data.file1 %>% slice(specimen.record.start[3] : specimen.record.stop[3])
# > spec4.dfb <- data.file1 %>% slice(specimen.record.start[4] : specimen.record.stop[4])
# > spec5.dfb <- data.file1 %>% slice(specimen.record.start[5] : specimen.record.stop[5])
# > spec6.dfb <- data.file1 %>% slice(specimen.record.start[6] : specimen.record.stop[6])
# > spec7.dfb <- data.file1 %>% slice(specimen.record.start[7] : specimen.record.stop[7])
# > spec8.dfb <- data.file1 %>% slice(specimen.record.start[8] : specimen.record.stop[8])
# > spec9.dfb <- data.file1 %>% slice(specimen.record.start[9] : specimen.record.stop[9])
# > spec10.dfb <- data.file1 %>% slice(specimen.record.start[10] : specimen.record.stop[10])
# > spec11.dfb <- data.file1 %>% slice(specimen.record.start[11] : specimen.record.stop[11])
# > spec12.dfb <- data.file1 %>% slice(specimen.record.start[12] : specimen.record.stop[12])
然后使用管道运算符 %>%
过滤切片数据帧,以仅提取数据并排除在每组新样本数据开头找到的评论。我还会改变这些数据框以添加一些额外的列,如上次更新所示,并且我会重命名标记为 V1 到 V5 的列。但为了简单起见,我只展示了 fil呃下面。请注意,row_number() > 4
表示注释在切片数据帧中停止的位置。同样,理想情况下,我想对每个数据帧(或数据集)进行迭代过滤。
# The following is a set of data frames filtered from the sliced data frames to exclue comment lines
# > spec1.dfc <- spec1.dfb %>% filter(row_number() > 4)
# > spec2.dfc <- spec2.dfb %>% filter(row_number() > 4)
# > spec3.dfc <- spec3.dfb %>% filter(row_number() > 4)
# > spec4.dfc <- spec4.dfb %>% filter(row_number() > 4)
# > spec5.dfc <- spec5.dfb %>% filter(row_number() > 4)
# > spec6.dfc <- spec6.dfb %>% filter(row_number() > 4)
# > spec7.dfc <- spec7.dfb %>% filter(row_number() > 4)
# > spec8.dfc <- spec8.dfb %>% filter(row_number() > 4)
# > spec9.dfc <- spec9.dfb %>% filter(row_number() > 4)
# > spec10.dfc <- spec10.dfb %>% filter(row_number() > 4)
# > spec11.dfc <- spec11.dfb %>% filter(row_number() > 4)
# > spec12.dfc <- spec12.dfb %>% filter(row_number() > 4)
最后,所有切片和过滤的数据帧都是行绑定的,以创建包含所有数据的最终数据帧。
all.dfc <- rbind(spec1.dfc, spec2.dfc, spec3.dfc,
spec4.dfc, spec5.dfc, spec6.dfc,
spec7.dfc, spec8.dfc, spec9.dfc,
spec10.dfc, spec11.dfc, spec12.dfc)
# > all.dfc
# Source: local data frame [4,307 x 5]
#
# V1 V2 V3 V4 V5
# 1 0.744 0.300 -0.046 0.197 -0.004
# 2 0.903 0.400 0.038 0.239 0.003
# 3 1.096 0.500 0.123 0.290 0.011
# 4 1.314 0.600 0.207 0.348 0.018
# 5 1.532 0.700 0.289 0.406 0.025
# 6 1.776 0.800 0.373 0.471 0.033
# 7 2.029 0.900 0.457 0.538 0.040
# 8 2.282 1.000 0.541 0.605 0.047
# 9 2.533 1.100 0.623 0.671 0.054
# 10 2.783 1.200 0.707 0.738 0.062
# .. ... ... ... ... ...
总而言之,需要将数据读入R,然后需要将文件分成(切片)对应于每个单独样本的部分。每个数据块都是特定样本的特定测试所特有的数据。然后需要过滤块(切片),并将其组合成一个数据框,并添加新的列。我已经尝试了 apply 系列中的几个循环函数,但似乎都让我望而却步。我正在考虑执行类似以下操作的功能。
注意:以下仅供参考,并非实际代码
my_function <- function {
my_data <- read(my_files)[i]
my_final_data_frame <- my_data %>% slice(my_data) %>% filter(my_data) %>% mutate(my_data) %>% rename(my_data)
repeat
}
my_function
代码为伪代码,仅为说明概念而给出,不代表对编码有任何理解。我不确定它究竟会怎么写。如果有人有任何想法,我欢迎他们。
再次感谢。
我发现下面的代码解决了这个问题。详情请参阅问题中提供的信息。该代码从对应于不同数据集的多个表单或块中提取数据,然后将它们放在单独的数据框中。首先,找到文件并将其读入列表,然后组合成一个完整的主数据框。其次,确定切片的起点和终点的切片点(索引)。第三,切片在确定的索引处执行,遍历索引,并将切片放入临时列表中。最后,操作切片索引列表以添加列和重命名列。
# load the required packages
require(plyr)
require(dplyr)
require(tidyr)
# find file names
file.names <- list.files(pattern = ".txt")
# read files
read.files <- vector(mode = "list", length = length(file.names))
read.files <- lapply(file.names, read.csv, sep = c(",", "\n"), header = F, stringsAsFactors = F)
read.files.df <- ldply(read.files)
# find indices
index.slice.start <- grep(pattern = "Test Method", read.files.df$V1)
minus_1 <- function(x) x - 1
stop.vector <- sapply(index.slice.start, minus_1)
stop.vector <- as.integer(sapply(index.slice.start, minus_1))
index.slice.stop <- c(stop.vector[-1], nrow(read.files.df))
# slice dataframes
tmp <- vector(mode = "list", length = length(index.slice.start))
for (i in 1:length(index.slice.start)) {
r <- index.slice.start[i]
p <- index.slice.stop[i]
tmp[[i]] <- slice(read.files.df, index.slice.start[i]:index.slice.stop[i])
}
# Construct the dataframe
all.df <- lapply(tmp, mutate, specimen = V2[3], sample = V2[2]) %>%
lapply(filter, row_number() > 4) %>%
lapply(rename, Variable1 = V1, Variable2 = V2, Variable3 = V3, Variable4 = V4, Variable5 = V5) %>%
ldply %>%
separate(sample, into = c("A", "B", "C", "D", "E", "F", "G"), sep = " ", extra = "merge") %>%
rename(weight = A, gsm = B, null = C, agent = D, condition = E, direction = F, id = G) %>%
select(-null, -gsm)
注意:已经进行了三个更新。欢迎提出想法
我有多个文本 (.txt) 文件,它们本质上是数据帧列表,其中包含多个样本中每个样本的数据。
每个样本的每组数据都以引号 ("") 开头,后跟一系列以逗号分隔的字符串形式的注释 ("string")。
我需要为每个样本分离出每组数据,单独的列,并添加带有评论中提供的信息的新列。
我要提取的数据在文件名"file2.ext",注释"Specimen Number"后面的标本编号
数据样本如下。
""
"Test Method","file1.ext"
"Sample I. D.","file2.ext"
"Specimen Number","1"
"A (unit1)","B (unit2)","C (unit3)","D (unit4)","E (%)"
0.744,0.300,-0.046,0.197,-0.004
0.903,0.400,0.038,0.239,0.003
1.096,0.500,0.123,0.290,0.011
1.314,0.600,0.207,0.348,0.018
1.532,0.700,0.289,0.406,0.025
1.776,0.800,0.373,0.471,0.033
2.029,0.900,0.457,0.538,0.040
2.282,1.000,0.541,0.605,0.047
2.533,1.100,0.623,0.671,0.054
2.783,1.200,0.707,0.738,0.062
3.044,1.300,0.792,0.807,0.069
3.319,1.400,0.876,0.880,0.076
3.587,1.500,0.958,0.951,0.084
""
"Test Method","file1.ext"
"Sample I. D.","file2.ext"
"Specimen Number","2"
"A (unit1)","B (unit2)","C (unit3)","D (unit4)","E (%)"
0.755,0.300,-0.055,0.218,-0.005
0.918,0.400,0.030,0.265,0.003
1.137,0.500,0.114,0.328,0.010
1.377,0.600,0.198,0.397,0.017
1.626,0.700,0.282,0.469,0.024
1.874,0.800,0.365,0.541,0.031
2.136,0.900,0.450,0.616,0.038
2.400,1.000,0.533,0.692,0.045
2.667,1.100,0.615,0.770,0.051
2.935,1.200,0.699,0.847,0.058
3.221,1.300,0.784,0.930,0.066
3.505,1.400,0.867,1.011,0.072
3.804,1.500,0.949,1.098,0.079
我已经能够构建可用的数据帧,但需要知道两件事: 1 -- 是否有更简单的读取方法可以让我将文件读入列表? 2 -- 如何自动读取文件并构建包括新列在内的最终数据帧?
使用
scan()
将文本文件读入 R 生成一个包含所有注释的字符向量。character.vector <- scan("file_name.txt", "")
在 'character.vector' 中找到评论 "Test Method" 出现的位置,以使用
识别每个标本grep()
specimen.vector <-grep(pattern= "Test Method", character.vector) > tm_test.2 [1] 1 382 764 1146 1528 1910 2292 2674 3056 3438 3820 4202
确定子集 'character.vector' 为单个样本构建新数据框所需的索引
> specimen.start.at <- specimen.vector + 24 > specimen.start.at [1] 25 406 788 1170 1552 1934 2316 2698 3080 3462 3844 4226 > specimen.stop.at <- specimen.vector + 381 > specimen.stop.at [1] 382 763 1145 1527 1909 2291 2673 3055 3437 3819 4201 4583
有 12 个标本带有向量 'specimen.start.at' 和 'specimen.stop.at' 指示的 idices。 例如,样本 1 的数据(不包括注释)跨越 25:382 in 'character.vector'.
我没有弄清楚如何为每个样本自动提取数据,所以我手动输入了如下索引
start <- specimen.start.at[specimen_number] finish <- specimen.stop.at[specimen_number] specimen.dataframe <- character.vector[start:finish] %>% strsplit(split = ",", fixed = TRUE) %>% ldply %>% tbl_df
每个标本的输出是一个包含 5 个未标记列的数据框。
V1 V2 V3 V4 V5
1 1.073 0.400 0.215 0.198 0.022
2 1.315 0.500 0.299 0.242 0.031
3 1.562 0.600 0.382 0.288 0.040
4 1.840 0.700 0.466 0.339 0.049
5 2.135 0.800 0.550 0.393 0.058
6 2.438 0.900 0.634 0.449 0.066
7 2.740 1.000 0.716 0.505 0.075
8 3.046 1.100 0.800 0.561 0.084
9 3.349 1.200 0.884 0.617 0.092
10 3.660 1.300 0.969 0.674 0.101
.. ... ... ... ... ...
这是扫描的输出:
[1] "Test Method" ",\"XXX" "YYY"
[4] "test" "-" "Edit"
[7] "ABB" "1-8-08.ext\"" "Sample I. D."
[10] ",\"1000" "gsm" "string"
[13] "string" "ab" "20796-87.ext\""
[16] "Specimen Number" ",\"1\"" "A (unit1)"
[19] ",\"B" "(unit2)\",\"C" "(unit3)\",\"D"
[22] "(unit4)\",\"E" "(%)\"" "0.744,0.300,-0.046,0.197,-0.004"
[25] "0.903,0.400,0.038,0.239,0.003" "1.096,0.500,0.123,0.290,0.011" "1.314,0.600,0.207,0.348,0.018"
一旦得到数据框,我将添加包括标本编号在内的几列,将它们合并为该特定文件的一个组合数据框,重命名列,然后构建一个包含每个文件数据的列表。我想我可能需要编写一个包含某些版本的 apply 函数族的函数。我想远离 for 循环。
使用了 Rstudio。
Session 信息
> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] plyr_1.8.3 tidyr_0.2.0 dplyr_0.4.2
loaded via a namespace (and not attached):
[1] magrittr_1.5 R6_2.0.1 assertthat_0.1 parallel_3.2.1 DBI_0.3.1 tools_3.2.1 Rcpp_0.11.6
我想提前感谢大家的帮助。
更新
进一步调查显示我可以使用 read.csv()
读取文本文件,将数据放入 R 中的数据框中。
当我使用下面的代码时
df <- read.csv("file.txt", sep = c(",", "\n"), header = F, stringsAsFactors= F)
生成的数据框如下所示
V1 V2 V3 V4 V5
1 Test Method file1.ext
2 Sample I. D. file2.ext
3 Specimen Number 1
4 A (unit1) B (unit2) C (unit3) D (unit4) E (%)
5 -0.150 0.000 -0.198 -0.006 -14.671
6 -0.147 0.100 -0.198 -0.006 -14.671
7 -0.190 0.300 -0.194 -0.007 -14.383
8 -0.177 0.400 -0.191 -0.007 -14.135
9 -0.163 0.500 -0.188 -0.006 -13.891
203 Test Method file1.ext
204 Sample I. D. file2.ext
205 Specimen Number 2
206 A (unit1) B (unit2) C (unit3) D (unit4) E (%)
207 -0.206 0.000 -0.162 -0.008 -11.967
208 -0.201 0.100 -0.162 -0.008 -11.967
209 -0.242 0.300 -0.158 -0.010 -11.679
210 -0.223 0.400 -0.154 -0.009 -11.435
211 -0.222 0.500 -0.151 -0.009 -11.187
212 -0.216 0.600 -0.148 -0.009 -10.939
这只是我的一部分。我之前的问题成立。再次,提前谢谢大家
更新 -- 澄清
我想提交另一个更新来阐明我想要什么样的输出,所以我决定手动进行以下操作。如前所述,我知道每个样本的数据从哪里开始。索引存储在向量中。
specimen.record.start <- grep(pattern = "Test Method", data.file1$V1)
> specimen.record.start
[1] 1 363 726 1089 1452 1815 2178 2541 2904 3267 3630 3993
我对 12 个样本中的每一个都使用 slice()
,并使用 specimen.record.start
中的索引来为切片选择正确的起点和终点。
spec1.dfa <- data.file1 %>% slice(1:362)
spec2.dfa <- data.file1 %>% slice(363:725)
spec3.dfa <- data.file1 %>% slice(726:1088)
spec4.dfa <- data.file1 %>% slice(1089:1451)
spec5.dfa <- data.file1 %>% slice(1452:1814)
spec6.dfa <- data.file1 %>% slice(1815:2177)
spec7.dfa <- data.file1 %>% slice(2178:2540)
spec8.dfa <- data.file1 %>% slice(2541:2903)
spec9.dfa <- data.file1 %>% slice(2904:3266)
spec10.dfa <- data.file1 %>% slice(3267:3629)
spec11.dfa <- data.file1 %>% slice(3630:3992)
spec12.dfa <- data.file1 %>% slice(3993:4355)
然后我构建了我想要的数据框如下:
> spec1.dfa %>% filter(row_number() > 4) %>% rename(A = V1, B = V2, C = V3, D = V4, E = V5) %>% mutate(specimen = 1, F = 1000, G = "cross", H = TRUE)
Source: local data frame [358 x 9]
A B C D E specimen F G H
1 0.744 0.300 -0.046 0.197 -0.004 1 1000 cross TRUE
2 0.903 0.400 0.038 0.239 0.003 1 1000 cross TRUE
3 1.096 0.500 0.123 0.290 0.011 1 1000 cross TRUE
4 1.314 0.600 0.207 0.348 0.018 1 1000 cross TRUE
5 1.532 0.700 0.289 0.406 0.025 1 1000 cross TRUE
6 1.776 0.800 0.373 0.471 0.033 1 1000 cross TRUE
7 2.029 0.900 0.457 0.538 0.040 1 1000 cross TRUE
8 2.282 1.000 0.541 0.605 0.047 1 1000 cross TRUE
9 2.533 1.100 0.623 0.671 0.054 1 1000 cross TRUE
10 2.783 1.200 0.707 0.738 0.062 1 1000 cross TRUE
.. ... ... ... ... ... ... ... ... ...
同样,我想使用一些东西来自动为文件中的每个标本记录插入索引号。还请记住,我拥有的文本文件包含多个特定于每个标本的记录。在这种情况下有十二个标本,但其他文件可能有更多或更少的标本。
再次,提前谢谢大家
最终更新 -- 简化,也许
我想包括一个最后的更新,它显示了我以前手动实现的代码版本,我真的很想只调用一个函数。如前所述,保存了两个向量以将索引显示为整数,表示原始文件中每个样本的数据开始和结束位置。
# > specimen.record.start
# [1] 1 363 726 1089 1452 1815 2178 2541 2904 3267 3630 3993
# > specimen.record.stop
# [1] 362 725 1088 1451 1814 2177 2540 2903 3266 3629 3992 4355
# > class(specimen.record.start)
# [1] "integer"
# > class(specimen.record.stop)
# [1] "integer"
之前的更新实现了同样的事情,只是索引号是手动输入到切片函数中的。下面我使用括号将索引号替换为矢量选择。理想情况下,我想调用几行代码的单个函数来遍历切片。我为每个切片数据框分配了自己的名称,但我认为它们都可以被输入到一个空数据框中。我只是不确定该怎么做。
# Again to illustrate create the data frames manually.
# The following is a set of data frames sliced from the orignial data
# > spec1.dfb <- data.file1 %>% slice(specimen.record.start[1] : specimen.record.stop[1])
# > spec2.dfb <- data.file1 %>% slice(specimen.record.start[2] : specimen.record.stop[2])
# > spec3.dfb <- data.file1 %>% slice(specimen.record.start[3] : specimen.record.stop[3])
# > spec4.dfb <- data.file1 %>% slice(specimen.record.start[4] : specimen.record.stop[4])
# > spec5.dfb <- data.file1 %>% slice(specimen.record.start[5] : specimen.record.stop[5])
# > spec6.dfb <- data.file1 %>% slice(specimen.record.start[6] : specimen.record.stop[6])
# > spec7.dfb <- data.file1 %>% slice(specimen.record.start[7] : specimen.record.stop[7])
# > spec8.dfb <- data.file1 %>% slice(specimen.record.start[8] : specimen.record.stop[8])
# > spec9.dfb <- data.file1 %>% slice(specimen.record.start[9] : specimen.record.stop[9])
# > spec10.dfb <- data.file1 %>% slice(specimen.record.start[10] : specimen.record.stop[10])
# > spec11.dfb <- data.file1 %>% slice(specimen.record.start[11] : specimen.record.stop[11])
# > spec12.dfb <- data.file1 %>% slice(specimen.record.start[12] : specimen.record.stop[12])
然后使用管道运算符 %>%
过滤切片数据帧,以仅提取数据并排除在每组新样本数据开头找到的评论。我还会改变这些数据框以添加一些额外的列,如上次更新所示,并且我会重命名标记为 V1 到 V5 的列。但为了简单起见,我只展示了 fil呃下面。请注意,row_number() > 4
表示注释在切片数据帧中停止的位置。同样,理想情况下,我想对每个数据帧(或数据集)进行迭代过滤。
# The following is a set of data frames filtered from the sliced data frames to exclue comment lines
# > spec1.dfc <- spec1.dfb %>% filter(row_number() > 4)
# > spec2.dfc <- spec2.dfb %>% filter(row_number() > 4)
# > spec3.dfc <- spec3.dfb %>% filter(row_number() > 4)
# > spec4.dfc <- spec4.dfb %>% filter(row_number() > 4)
# > spec5.dfc <- spec5.dfb %>% filter(row_number() > 4)
# > spec6.dfc <- spec6.dfb %>% filter(row_number() > 4)
# > spec7.dfc <- spec7.dfb %>% filter(row_number() > 4)
# > spec8.dfc <- spec8.dfb %>% filter(row_number() > 4)
# > spec9.dfc <- spec9.dfb %>% filter(row_number() > 4)
# > spec10.dfc <- spec10.dfb %>% filter(row_number() > 4)
# > spec11.dfc <- spec11.dfb %>% filter(row_number() > 4)
# > spec12.dfc <- spec12.dfb %>% filter(row_number() > 4)
最后,所有切片和过滤的数据帧都是行绑定的,以创建包含所有数据的最终数据帧。
all.dfc <- rbind(spec1.dfc, spec2.dfc, spec3.dfc,
spec4.dfc, spec5.dfc, spec6.dfc,
spec7.dfc, spec8.dfc, spec9.dfc,
spec10.dfc, spec11.dfc, spec12.dfc)
# > all.dfc
# Source: local data frame [4,307 x 5]
#
# V1 V2 V3 V4 V5
# 1 0.744 0.300 -0.046 0.197 -0.004
# 2 0.903 0.400 0.038 0.239 0.003
# 3 1.096 0.500 0.123 0.290 0.011
# 4 1.314 0.600 0.207 0.348 0.018
# 5 1.532 0.700 0.289 0.406 0.025
# 6 1.776 0.800 0.373 0.471 0.033
# 7 2.029 0.900 0.457 0.538 0.040
# 8 2.282 1.000 0.541 0.605 0.047
# 9 2.533 1.100 0.623 0.671 0.054
# 10 2.783 1.200 0.707 0.738 0.062
# .. ... ... ... ... ...
总而言之,需要将数据读入R,然后需要将文件分成(切片)对应于每个单独样本的部分。每个数据块都是特定样本的特定测试所特有的数据。然后需要过滤块(切片),并将其组合成一个数据框,并添加新的列。我已经尝试了 apply 系列中的几个循环函数,但似乎都让我望而却步。我正在考虑执行类似以下操作的功能。
注意:以下仅供参考,并非实际代码
my_function <- function {
my_data <- read(my_files)[i]
my_final_data_frame <- my_data %>% slice(my_data) %>% filter(my_data) %>% mutate(my_data) %>% rename(my_data)
repeat
}
my_function
代码为伪代码,仅为说明概念而给出,不代表对编码有任何理解。我不确定它究竟会怎么写。如果有人有任何想法,我欢迎他们。
再次感谢。
我发现下面的代码解决了这个问题。详情请参阅问题中提供的信息。该代码从对应于不同数据集的多个表单或块中提取数据,然后将它们放在单独的数据框中。首先,找到文件并将其读入列表,然后组合成一个完整的主数据框。其次,确定切片的起点和终点的切片点(索引)。第三,切片在确定的索引处执行,遍历索引,并将切片放入临时列表中。最后,操作切片索引列表以添加列和重命名列。
# load the required packages
require(plyr)
require(dplyr)
require(tidyr)
# find file names
file.names <- list.files(pattern = ".txt")
# read files
read.files <- vector(mode = "list", length = length(file.names))
read.files <- lapply(file.names, read.csv, sep = c(",", "\n"), header = F, stringsAsFactors = F)
read.files.df <- ldply(read.files)
# find indices
index.slice.start <- grep(pattern = "Test Method", read.files.df$V1)
minus_1 <- function(x) x - 1
stop.vector <- sapply(index.slice.start, minus_1)
stop.vector <- as.integer(sapply(index.slice.start, minus_1))
index.slice.stop <- c(stop.vector[-1], nrow(read.files.df))
# slice dataframes
tmp <- vector(mode = "list", length = length(index.slice.start))
for (i in 1:length(index.slice.start)) {
r <- index.slice.start[i]
p <- index.slice.stop[i]
tmp[[i]] <- slice(read.files.df, index.slice.start[i]:index.slice.stop[i])
}
# Construct the dataframe
all.df <- lapply(tmp, mutate, specimen = V2[3], sample = V2[2]) %>%
lapply(filter, row_number() > 4) %>%
lapply(rename, Variable1 = V1, Variable2 = V2, Variable3 = V3, Variable4 = V4, Variable5 = V5) %>%
ldply %>%
separate(sample, into = c("A", "B", "C", "D", "E", "F", "G"), sep = " ", extra = "merge") %>%
rename(weight = A, gsm = B, null = C, agent = D, condition = E, direction = F, id = G) %>%
select(-null, -gsm)