如何将一列分隔为多列(复杂列)

How to separate one column to multiple column (complex column)

我正在尝试根据主题和成绩将列 "Grade" 分隔为多个列

    grade<-read.csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/High_school_Grade.csv",sep=";")

# Rename the column names

    names(grade)<-c("Student_ID","Name","Venue","Grade")

    head(grade)

    # Separate `Grade` into `subject` variables and coresponding `Grade`columns
    library(tidyverse)


    df<- grade %>% separate(Grade,paste("V",1:7,sep="_"),sep=":")

    head(df)

    # It still is not separating `subject ` and `grade` independently

    # Here is what I want it to look like

    new_df<-df[c(1:5),c(1:4)]

    new_df<-data.frame(new_df, V2=c(1:5)) # the same for V2,4,5,6,,7 to separate subject and grade

    new_df 

我正在尝试使用 dplyr 和 stringr,但无法产生我预期的结果

这是一个尝试使用 tidyverse package.After 将所有内容转换为字符(即 grade[] <- lapply(grade, as.character))我们创建一个自定义函数,returns 排序 subject:grade每个 StudentID。然后我们用unnest把它变长,用separate把它分成两列; SubjectGrade。最后我们 spread 为每个主题获得一列。

library(tidyverse)

#This function could definetely be more elegant or even avoided
#  but this is as far as my regex knowledge allows me to go

mysplit <- function(x){
  y <- strsplit(x, ':\s+|\s+')[[1]]
  z <- paste0(y[c(T, F)], ': ', y[c(F, T)])
  return(z[order(sub(':.*', '', z))])
}

grade %>% 
  mutate(Grade = lapply(Grade, mysplit)) %>% 
  unnest() %>% 
  separate(Grade, into = c('Subject', 'Grade'), sep = ': ') %>% 
  spread(Subject, Grade)

这样拆分:

...     Biology Chemitry English Geography History Literature Math Physics
...   1    6.00     6.00    <NA>      <NA>    <NA>       7.50 4.25    6.80
...   2    5.80     6.00    <NA>      <NA>    <NA>       6.00 5.75    <NA>
...   3    <NA>     <NA>    <NA>      8.00    4.50       7.75 2.25    <NA>
...   4    <NA>     <NA>    <NA>      7.25    7.50       7.75 3.25    <NA>
...   5    <NA>     <NA>    <NA>      7.75    4.50       8.25 1.75    <NA>
...   6    <NA>     6.60    6.78      <NA>    <NA>       7.00 8.75    8.40
.
.

为了更好地理解函数,您应该将其分解。 假设 x 如下:

x
#[1] "Math:   4.25   Literature:   7.50   Physics:   6.80   Chemitry:   6.00   Biology:   6.00"

每隔 space: space 拆分它以获得以下向量

y <- strsplit(x, ':\s+|\s+')[[1]]
y
 #[1] "Math"       "4.25"       "Literature" "7.50"       "Physics"    "6.80"       "Chemitry"   "6.00"       "Biology"    "6.00"

将其粘贴在一起,首先是所有第一个元素(即主题,y[c(TRUE, FALSE)]),然后是所有第二个元素(即成绩 y[c(FALSE, TRUE)]),使用 : 分隔符

z <- paste0(y[c(T, F)], ': ', y[c(F, T)])
z
#[1] "Math: 4.25"       "Literature: 7.50" "Physics: 6.80"    "Chemitry: 6.00"   "Biology: 6.00"   

最后它输出一个排序的(基于单词sub(':.*', '', z))向量

z[order(sub(':.*', '', z))]
#[1] "Biology: 6.00"    "Chemitry: 6.00"   "Literature: 7.50" "Math: 4.25"       "Physics: 6.80"

正如@rosscova 指出的那样,字符串不需要排序,这简化了很多(毕竟不需要函数),即

grade %>% 
  mutate(Grade = strsplit(Grade, '[0-9]\s+')) %>% 
  unnest() %>% 
  separate(Grade, into = c('Subject', 'Grade'), sep = ': ') %>% 
  spread(Subject, Grade)

在我下面的解决方案中,我使用了 tidyverserebus 包中的函数。 rebus 包使用人类可读代码逐段构建正则表达式。

 library(tidyverse)
 library(rebus)
 grade<-read.csv("https://raw.githubusercontent.com/tuyenhavan/Statistics/Dataset/High_school_Grade.csv",
                 sep = ";", stringsAsFactors = FALSE)

 grade_new <- grade %>%
   mutate(DIEM_THI2 = str_replace_all(DIEM_THI, pattern = ":" %R% one_or_more(SPC), "-")) %>%
   separate_rows(DIEM_THI2, sep = one_or_more(SPC)) %>%
   separate(DIEM_THI2, c("SUBJECT", "GRADE"), sep = "-") %>%
   spread(SUBJECT,GRADE)

生成的数据框如下所示:

head(grade_new[,5:12])
#   Biology Chemitry English Geography History Literature Math Physics
# 1    6.00     6.00    <NA>      <NA>    <NA>       7.50 4.25    6.80
# 2    5.80     6.00    <NA>      <NA>    <NA>       6.00 5.75    <NA>
# 3    <NA>     <NA>    <NA>      8.00    4.50       7.75 2.25    <NA>
# 4    <NA>     <NA>    <NA>      7.25    7.50       7.75 3.25    <NA>
# 5    <NA>     <NA>    <NA>      7.75    4.50       8.25 1.75    <NA>
# 6    <NA>     6.60    6.78      <NA>    <NA>       7.00 8.75    8.40

代码可以这样理解:

  1. 所有冒号+space 子字符串都替换为连字符。即 "Math: 4.25 Literature: 7.50" 变为 "Math-4.25 Literature-7.50"。这是使用 str_replace_all 函数完成的。让我们调用新变量 DIEM_THI2.
  2. separate_rows 函数将 space 分隔的列 DIEM_THI2 拆分为单独的行,即 "Math-4.25""Literature-7.50" 跨越两个不同的行。
  3. DIEM_THI2 列分为两列,即 SUBJECTGRADE,其中前者包含 "Math""Literature" 等值,后者包含 "4.25""7.50" 等值。
  4. 键值对或 SUBJECT-GRADE 对分布在多个列中。