R melting/gathering 包含 2 个标题行到属性中的文件
R melting/gathering file with 2 heading rows into attributes
我需要读入一个 2 级 headers 的数据文件,数据如下所示:
| | Jone Doe | | | | | | | Jane Doe | | | | | | |
|----------|----------|------|------|------|------|------|------|----------|------|------|------|------|------|------|
| Date | Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 | Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 |
| 1-Jul-13 | 49 | 42 | 20 | 18 | 23 | 16 | 29 | 48 | 33 | 24 | 10 | 43 | 13 | 43 |
| 2-Jul-13 | 17 | 16 | 43 | 33 | 37 | 37 | 10 | 7 | 45 | 19 | 4 | 41 | 41 | 20 |
| 3-Jul-13 | 35 | 39 | 42 | 35 | 5 | 12 | 22 | 3 | 28 | 23 | 10 | 12 | 5 | 8 |
我需要它看起来像这样:
| Date | Name | Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 |
|----------|----------|------|------|------|------|------|------|------|
| 1-Jul-13 | Jone Doe | 49 | 42 | 20 | 18 | 23 | 16 | 29 |
| 2-Jul-13 | Jone Doe | 17 | 16 | 43 | 33 | 37 | 37 | 10 |
| 3-Jul-13 | Jone Doe | 35 | 39 | 42 | 35 | 5 | 12 | 22 |
| 1-Jul-13 | Jane Doe | 48 | 33 | 24 | 10 | 43 | 13 | 43 |
| 2-Jul-13 | Jane Doe | 7 | 45 | 19 | 4 | 41 | 41 | 20 |
| 3-Jul-13 | Jane Doe | 3 | 28 | 23 | 10 | 12 | 5 | 8 |
知道如何在没有 hard-coding 的情况下执行此操作吗?我一直在尝试使用 melt() 和 gather() 没有任何运气
编辑:
示例数据:https://drive.google.com/open?id=1T4KkAk5D55_nXsHlr1Aozed6d49qFM_8
lst1 的输出:
nm1 的输出:
[1] "John Doe" "John Doe" "John Doe" "John Doe" "John Doe" "John Doe" "John Doe" "Jane Doe"
[9] "Jane Doe" "Jane Doe" "Jane Doe" "Jane Doe" "Jane Doe" "Jane Doe" "Jose Doe" "Jose Doe"
[17] "Jose Doe" "Jose Doe" "Jose Doe" "Jose Doe" "Jose Doe" "Jacob Doe" "Jacob Doe" "Jacob Doe"
[25] "Jacob Doe" "Jacob Doe" "Jacob Doe" "Jacob Doe"
一个选项是使用 skip
参数读取数据集以跳过第一行,然后,我们可以根据重复的列名 split
将数据 list
然后根据第一行和 rbind
list
元素为每个 list
元素创建 'Name' 列以创建单个 data.frame
dat1 <- read.csv("file.csv", header = TRUE, skip = 1,
stringsAsFactors = FALSE, na.strings = "N/A")
nm1 <- c("John Doe", "Jane Doe")[cumsum(grepl("Col1", names(dat1)[-1]))]
nm2 <- unique( sub("\.\d+$", "", names(dat1)[-1]))
lst1 <- split.default(dat1[-1], nm1)
dat2 <- cbind(dat1['Date'], do.call(rbind, Map(cbind, Name = nm1, lapply(lst1, setNames, nm2))))
row.names(dat2) <- NULL
head(dat2, 5)
# Date Name Col1 Col2 Col3 Col4 Col5 Col6 Col7
#1 1-Jul-13 John Doe 52 6 NA NA 7 20 25
#2 2-Jul-13 John Doe 43 7 NA NA NA 25 17
#3 3-Jul-13 John Doe 55 5 NA NA 4 23 28
#4 4-Jul-13 John Doe 42 6 NA NA 7 21 14
#5 5-Jul-13 John Doe 64 3 NA NA 5 36 22
dim(dat2)
#[1] 140 9
注意,如果列的块数很大,一个选项是用readLines
读取第一行
v1 <- readLines("file.csv", n = 1)
v2 <- scan(text = gsub(",{2,}", ",", trimws(v1)), sep=",", what = "", quiet = TRUE)
v3 <- v2[nzchar(v2)]
并将其提供给 cumsum
步骤
nm1 <- v3[cumsum(grepl("Col1", names(dat1)[-1]))]
我需要读入一个 2 级 headers 的数据文件,数据如下所示:
| | Jone Doe | | | | | | | Jane Doe | | | | | | |
|----------|----------|------|------|------|------|------|------|----------|------|------|------|------|------|------|
| Date | Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 | Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 |
| 1-Jul-13 | 49 | 42 | 20 | 18 | 23 | 16 | 29 | 48 | 33 | 24 | 10 | 43 | 13 | 43 |
| 2-Jul-13 | 17 | 16 | 43 | 33 | 37 | 37 | 10 | 7 | 45 | 19 | 4 | 41 | 41 | 20 |
| 3-Jul-13 | 35 | 39 | 42 | 35 | 5 | 12 | 22 | 3 | 28 | 23 | 10 | 12 | 5 | 8 |
我需要它看起来像这样:
| Date | Name | Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 |
|----------|----------|------|------|------|------|------|------|------|
| 1-Jul-13 | Jone Doe | 49 | 42 | 20 | 18 | 23 | 16 | 29 |
| 2-Jul-13 | Jone Doe | 17 | 16 | 43 | 33 | 37 | 37 | 10 |
| 3-Jul-13 | Jone Doe | 35 | 39 | 42 | 35 | 5 | 12 | 22 |
| 1-Jul-13 | Jane Doe | 48 | 33 | 24 | 10 | 43 | 13 | 43 |
| 2-Jul-13 | Jane Doe | 7 | 45 | 19 | 4 | 41 | 41 | 20 |
| 3-Jul-13 | Jane Doe | 3 | 28 | 23 | 10 | 12 | 5 | 8 |
知道如何在没有 hard-coding 的情况下执行此操作吗?我一直在尝试使用 melt() 和 gather() 没有任何运气
编辑:
示例数据:https://drive.google.com/open?id=1T4KkAk5D55_nXsHlr1Aozed6d49qFM_8
lst1 的输出:
nm1 的输出:
[1] "John Doe" "John Doe" "John Doe" "John Doe" "John Doe" "John Doe" "John Doe" "Jane Doe"
[9] "Jane Doe" "Jane Doe" "Jane Doe" "Jane Doe" "Jane Doe" "Jane Doe" "Jose Doe" "Jose Doe"
[17] "Jose Doe" "Jose Doe" "Jose Doe" "Jose Doe" "Jose Doe" "Jacob Doe" "Jacob Doe" "Jacob Doe"
[25] "Jacob Doe" "Jacob Doe" "Jacob Doe" "Jacob Doe"
一个选项是使用 skip
参数读取数据集以跳过第一行,然后,我们可以根据重复的列名 split
将数据 list
然后根据第一行和 rbind
list
元素为每个 list
元素创建 'Name' 列以创建单个 data.frame
dat1 <- read.csv("file.csv", header = TRUE, skip = 1,
stringsAsFactors = FALSE, na.strings = "N/A")
nm1 <- c("John Doe", "Jane Doe")[cumsum(grepl("Col1", names(dat1)[-1]))]
nm2 <- unique( sub("\.\d+$", "", names(dat1)[-1]))
lst1 <- split.default(dat1[-1], nm1)
dat2 <- cbind(dat1['Date'], do.call(rbind, Map(cbind, Name = nm1, lapply(lst1, setNames, nm2))))
row.names(dat2) <- NULL
head(dat2, 5)
# Date Name Col1 Col2 Col3 Col4 Col5 Col6 Col7
#1 1-Jul-13 John Doe 52 6 NA NA 7 20 25
#2 2-Jul-13 John Doe 43 7 NA NA NA 25 17
#3 3-Jul-13 John Doe 55 5 NA NA 4 23 28
#4 4-Jul-13 John Doe 42 6 NA NA 7 21 14
#5 5-Jul-13 John Doe 64 3 NA NA 5 36 22
dim(dat2)
#[1] 140 9
注意,如果列的块数很大,一个选项是用readLines
v1 <- readLines("file.csv", n = 1)
v2 <- scan(text = gsub(",{2,}", ",", trimws(v1)), sep=",", what = "", quiet = TRUE)
v3 <- v2[nzchar(v2)]
并将其提供给 cumsum
步骤
nm1 <- v3[cumsum(grepl("Col1", names(dat1)[-1]))]