使用 Headers 卡在行中的整理和投射数据

Tidy and Cast Data With Headers Stuck in Rows

demodf <- data.frame(
  name = c("Mike","Mike","Mike","Mike","Mike","Joe","Joe","Joe","Joe","Joe"),
  Field = c("EDUCATION","Degree","Title","WORK", "Title", "EDUCATION","Degree","Title", "WORK","Title"),
  Values = c("EDUCATION", "Masters", "Student", "WORK", "VP Sales", "EDUCATION", "Bachelors","Student", "WORK", "Analyst"))

   name     Field    Values
1  Mike EDUCATION EDUCATION
2  Mike    Degree   Masters
3  Mike     Title   Student
4  Mike      WORK      WORK
5  Mike     Title  VP Sales
6   Joe EDUCATION EDUCATION
7   Joe    Degree  Bachelors
8   Joe     Title   Student
9   Joe      WORK      WORK
10  Joe     Title   Analyst

我想将 tidyr::spreadreshape2::dcast 转换为宽格式,其中 Field 成为列 headers。

该代码看起来像 dcast(demodf, name ~ Values)demodf %>% spread(Field, Values)。但是,dcast 强制转换为数字,而 spread 会引发错误。

问题是 "Title" 重复。您可以看到,由于数据中的一个怪癖,我们在数据中将 EDUCATION 和 WORK 作为 "false" headers。是否可以用大写的 header 标记每个 Field 条目,以便 dcast 起作用(即 Title_EDUCATIONTitle_WORK)?将这种转换应用于整个 Field 会更好,因此 "EDUCATION" 和 "WORK" 一起消失,我们只剩下 Degree_EDUCATION, TITLE_EDUCATION... 等等.).

注意实际数据中的header较多,所以最好将"false headers"识别为all-cap条目,或者[=]所在的条目26=]

期望的输出:

output <- data.frame(
 Name=c("Mike", "Joe"),
 Degree_EDUCATION =c("Masters", "Bachelors"),
 Title_EDUCATION = c("Student", "Student"),
 Title_WORK= c("VP Sales", "Analyst"))

  Name Degree_EDUCATION Title_EDUCATION Title_WORK
1 Mike          Masters         Student   VP Sales
2  Joe        Bachelors         Student    Analyst

关键是将重复的类别行添加为新列,然后您就可以轻松使用它了。

首先,添加stringsAsFactors=FALSE所以可以比较FieldValues

demodf <- data.frame(
  name = c("Mike","Mike","Mike","Mike","Mike","Joe","Joe","Joe","Joe","Joe"),
  Field = c("EDUCATION","Degree","Title","WORK", "Title", "EDUCATION","Degree","Title", "WORK","Title"),
  Values = c("EDUCATION", "Masters", "Student", "WORK", "VP Sales", "EDUCATION", "Bachelors","Student", "WORK", "Analyst"),
  stringsAsFactors=FALSE)

现在使用 tidyrdplyr 添加列,如果该行是一个类别和该类别的名称,然后填充缺失值,然后删除额外的行和列。

library(tidyr)
library(dplyr)
d2 <- demodf %>% mutate(IsCategory=Field==Values,
                        Category=ifelse(IsCategory, Field, NA)) %>%
  fill(Category) %>% subset(!IsCategory, select=-IsCategory)
d2
##    name  Field    Values  Category
## 2  Mike Degree   Masters EDUCATION
## 3  Mike  Title   Student EDUCATION
## 5  Mike  Title  VP Sales      WORK
## 7   Joe Degree Bachelors EDUCATION
## 8   Joe  Title   Student EDUCATION
## 10  Joe  Title   Analyst      WORK

dcast 将如您所愿地工作!

library(reshape2)    
dcast(d2, name ~ Field+Category, value.var="Values")
##   name Degree_EDUCATION Title_EDUCATION Title_WORK
## 1  Joe        Bachelors         Student    Analyst
## 2 Mike          Masters         Student   VP Sales

这是 data.table 的尝试。这需要使用 stringsAsFactors=FALSE。

library(data.table)
# get groupings by titles (all caps)
setDT(demodf)[, head := cumsum(Field == toupper(Field))]
# merge titles onto full dataset and paste title to Field
demodf[demodf[Field == toupper(Field), .(Field, head)], on="head",
       Field := paste(Field, i.Field, sep="_"), by=.EACHI]
# now reshape wide
dcast(demodf[Values != toupper(Values),], name~Field, value.var="Values")

这个returns

   name Degree_EDUCATION Title_EDUCATION Title_WORK
1:  Joe        Bachelors         Student    Analyst
2: Mike          Masters         Student   VP Sales

数据

demodf <-
structure(list(name = c("Mike", "Mike", "Mike", "Mike", "Mike", 
"Joe", "Joe", "Joe", "Joe", "Joe"), Field = c("EDUCATION", "Degree", 
"Title", "WORK", "Title", "EDUCATION", "Degree", "Title", "WORK", 
"Title"), Values = c("EDUCATION", "Masters", "Student", "WORK", 
"VP Sales", "EDUCATION", "Bachelors", "Student", "WORK", "Analyst"
)), .Names = c("name", "Field", "Values"), row.names = c(NA, 
-10L), class = "data.frame")