使用 Headers 卡在行中的整理和投射数据
Tidy and Cast Data With Headers Stuck in Rows
demodf <- data.frame(
name = c("Mike","Mike","Mike","Mike","Mike","Joe","Joe","Joe","Joe","Joe"),
Field = c("EDUCATION","Degree","Title","WORK", "Title", "EDUCATION","Degree","Title", "WORK","Title"),
Values = c("EDUCATION", "Masters", "Student", "WORK", "VP Sales", "EDUCATION", "Bachelors","Student", "WORK", "Analyst"))
name Field Values
1 Mike EDUCATION EDUCATION
2 Mike Degree Masters
3 Mike Title Student
4 Mike WORK WORK
5 Mike Title VP Sales
6 Joe EDUCATION EDUCATION
7 Joe Degree Bachelors
8 Joe Title Student
9 Joe WORK WORK
10 Joe Title Analyst
我想将 tidyr::spread
或 reshape2::dcast
转换为宽格式,其中 Field
成为列 headers。
该代码看起来像 dcast(demodf, name ~ Values)
或 demodf %>% spread(Field, Values)
。但是,dcast
强制转换为数字,而 spread
会引发错误。
问题是 "Title" 重复。您可以看到,由于数据中的一个怪癖,我们在数据中将 EDUCATION 和 WORK 作为 "false" headers。是否可以用大写的 header 标记每个 Field
条目,以便 dcast
起作用(即 Title_EDUCATION
和 Title_WORK
)?将这种转换应用于整个 Field
会更好,因此 "EDUCATION" 和 "WORK" 一起消失,我们只剩下 Degree_EDUCATION, TITLE_EDUCATION
... 等等.).
注意实际数据中的header较多,所以最好将"false headers"识别为all-cap条目,或者[=]所在的条目26=]
期望的输出:
output <- data.frame(
Name=c("Mike", "Joe"),
Degree_EDUCATION =c("Masters", "Bachelors"),
Title_EDUCATION = c("Student", "Student"),
Title_WORK= c("VP Sales", "Analyst"))
Name Degree_EDUCATION Title_EDUCATION Title_WORK
1 Mike Masters Student VP Sales
2 Joe Bachelors Student Analyst
关键是将重复的类别行添加为新列,然后您就可以轻松使用它了。
首先,添加stringsAsFactors=FALSE
所以可以比较Field
和Values
:
demodf <- data.frame(
name = c("Mike","Mike","Mike","Mike","Mike","Joe","Joe","Joe","Joe","Joe"),
Field = c("EDUCATION","Degree","Title","WORK", "Title", "EDUCATION","Degree","Title", "WORK","Title"),
Values = c("EDUCATION", "Masters", "Student", "WORK", "VP Sales", "EDUCATION", "Bachelors","Student", "WORK", "Analyst"),
stringsAsFactors=FALSE)
现在使用 tidyr
和 dplyr
添加列,如果该行是一个类别和该类别的名称,然后填充缺失值,然后删除额外的行和列。
library(tidyr)
library(dplyr)
d2 <- demodf %>% mutate(IsCategory=Field==Values,
Category=ifelse(IsCategory, Field, NA)) %>%
fill(Category) %>% subset(!IsCategory, select=-IsCategory)
d2
## name Field Values Category
## 2 Mike Degree Masters EDUCATION
## 3 Mike Title Student EDUCATION
## 5 Mike Title VP Sales WORK
## 7 Joe Degree Bachelors EDUCATION
## 8 Joe Title Student EDUCATION
## 10 Joe Title Analyst WORK
dcast
将如您所愿地工作!
library(reshape2)
dcast(d2, name ~ Field+Category, value.var="Values")
## name Degree_EDUCATION Title_EDUCATION Title_WORK
## 1 Joe Bachelors Student Analyst
## 2 Mike Masters Student VP Sales
这是 data.table
的尝试。这需要使用 stringsAsFactors=FALSE。
library(data.table)
# get groupings by titles (all caps)
setDT(demodf)[, head := cumsum(Field == toupper(Field))]
# merge titles onto full dataset and paste title to Field
demodf[demodf[Field == toupper(Field), .(Field, head)], on="head",
Field := paste(Field, i.Field, sep="_"), by=.EACHI]
# now reshape wide
dcast(demodf[Values != toupper(Values),], name~Field, value.var="Values")
这个returns
name Degree_EDUCATION Title_EDUCATION Title_WORK
1: Joe Bachelors Student Analyst
2: Mike Masters Student VP Sales
数据
demodf <-
structure(list(name = c("Mike", "Mike", "Mike", "Mike", "Mike",
"Joe", "Joe", "Joe", "Joe", "Joe"), Field = c("EDUCATION", "Degree",
"Title", "WORK", "Title", "EDUCATION", "Degree", "Title", "WORK",
"Title"), Values = c("EDUCATION", "Masters", "Student", "WORK",
"VP Sales", "EDUCATION", "Bachelors", "Student", "WORK", "Analyst"
)), .Names = c("name", "Field", "Values"), row.names = c(NA,
-10L), class = "data.frame")
demodf <- data.frame(
name = c("Mike","Mike","Mike","Mike","Mike","Joe","Joe","Joe","Joe","Joe"),
Field = c("EDUCATION","Degree","Title","WORK", "Title", "EDUCATION","Degree","Title", "WORK","Title"),
Values = c("EDUCATION", "Masters", "Student", "WORK", "VP Sales", "EDUCATION", "Bachelors","Student", "WORK", "Analyst"))
name Field Values
1 Mike EDUCATION EDUCATION
2 Mike Degree Masters
3 Mike Title Student
4 Mike WORK WORK
5 Mike Title VP Sales
6 Joe EDUCATION EDUCATION
7 Joe Degree Bachelors
8 Joe Title Student
9 Joe WORK WORK
10 Joe Title Analyst
我想将 tidyr::spread
或 reshape2::dcast
转换为宽格式,其中 Field
成为列 headers。
该代码看起来像 dcast(demodf, name ~ Values)
或 demodf %>% spread(Field, Values)
。但是,dcast
强制转换为数字,而 spread
会引发错误。
问题是 "Title" 重复。您可以看到,由于数据中的一个怪癖,我们在数据中将 EDUCATION 和 WORK 作为 "false" headers。是否可以用大写的 header 标记每个 Field
条目,以便 dcast
起作用(即 Title_EDUCATION
和 Title_WORK
)?将这种转换应用于整个 Field
会更好,因此 "EDUCATION" 和 "WORK" 一起消失,我们只剩下 Degree_EDUCATION, TITLE_EDUCATION
... 等等.).
注意实际数据中的header较多,所以最好将"false headers"识别为all-cap条目,或者[=]所在的条目26=]
期望的输出:
output <- data.frame(
Name=c("Mike", "Joe"),
Degree_EDUCATION =c("Masters", "Bachelors"),
Title_EDUCATION = c("Student", "Student"),
Title_WORK= c("VP Sales", "Analyst"))
Name Degree_EDUCATION Title_EDUCATION Title_WORK
1 Mike Masters Student VP Sales
2 Joe Bachelors Student Analyst
关键是将重复的类别行添加为新列,然后您就可以轻松使用它了。
首先,添加stringsAsFactors=FALSE
所以可以比较Field
和Values
:
demodf <- data.frame(
name = c("Mike","Mike","Mike","Mike","Mike","Joe","Joe","Joe","Joe","Joe"),
Field = c("EDUCATION","Degree","Title","WORK", "Title", "EDUCATION","Degree","Title", "WORK","Title"),
Values = c("EDUCATION", "Masters", "Student", "WORK", "VP Sales", "EDUCATION", "Bachelors","Student", "WORK", "Analyst"),
stringsAsFactors=FALSE)
现在使用 tidyr
和 dplyr
添加列,如果该行是一个类别和该类别的名称,然后填充缺失值,然后删除额外的行和列。
library(tidyr)
library(dplyr)
d2 <- demodf %>% mutate(IsCategory=Field==Values,
Category=ifelse(IsCategory, Field, NA)) %>%
fill(Category) %>% subset(!IsCategory, select=-IsCategory)
d2
## name Field Values Category
## 2 Mike Degree Masters EDUCATION
## 3 Mike Title Student EDUCATION
## 5 Mike Title VP Sales WORK
## 7 Joe Degree Bachelors EDUCATION
## 8 Joe Title Student EDUCATION
## 10 Joe Title Analyst WORK
dcast
将如您所愿地工作!
library(reshape2)
dcast(d2, name ~ Field+Category, value.var="Values")
## name Degree_EDUCATION Title_EDUCATION Title_WORK
## 1 Joe Bachelors Student Analyst
## 2 Mike Masters Student VP Sales
这是 data.table
的尝试。这需要使用 stringsAsFactors=FALSE。
library(data.table)
# get groupings by titles (all caps)
setDT(demodf)[, head := cumsum(Field == toupper(Field))]
# merge titles onto full dataset and paste title to Field
demodf[demodf[Field == toupper(Field), .(Field, head)], on="head",
Field := paste(Field, i.Field, sep="_"), by=.EACHI]
# now reshape wide
dcast(demodf[Values != toupper(Values),], name~Field, value.var="Values")
这个returns
name Degree_EDUCATION Title_EDUCATION Title_WORK
1: Joe Bachelors Student Analyst
2: Mike Masters Student VP Sales
数据
demodf <-
structure(list(name = c("Mike", "Mike", "Mike", "Mike", "Mike",
"Joe", "Joe", "Joe", "Joe", "Joe"), Field = c("EDUCATION", "Degree",
"Title", "WORK", "Title", "EDUCATION", "Degree", "Title", "WORK",
"Title"), Values = c("EDUCATION", "Masters", "Student", "WORK",
"VP Sales", "EDUCATION", "Bachelors", "Student", "WORK", "Analyst"
)), .Names = c("name", "Field", "Values"), row.names = c(NA,
-10L), class = "data.frame")