如何根据特定条件对大型数据框的列进行排序?
How can I sort the columns of a large data frame based on specific criteria?
我想根据特定的理由对大型数据框(大约 14000 个变量)的列进行排序。
列名称具有以下结构 (Condition_Sleepstage_Parameter_Electrode_Nightpart):
[1] "Adapt_N2_negLengthLoc_C3_firstHour" "Adapt_N3_negLengthLoc_C3_firstHour"
[3] "Adapt_NREM_negLengthLoc_C3_firstHour" "Book_N2_negLengthLoc_C3_firstHour"
[5] "Book_N3_negLengthLoc_C3_firstHour" "Book_NREM_negLengthLoc_C3_firstHour"
R 中的列按纯字母顺序排列,但我希望它们按以下系统的逻辑顺序排列:
首先,变量应该在每个参数的块中呈现。 (顺序:"negLengthLoc"、"posLength"、"wholeLength"、"negPeak"、"nbnegPeaks"、"initialMeannegSlope"、"finalMeannegSlope"、"initialMaxnegslope" , "finalMaxnegslope", "posPeak", "nbposPeaks", "initialMeannposSlope", "finalMeanposSlope", "initialMaxposSlope", "PeaktoPeak", "Number", "Density")
在这些块中,最高阶层次结构应归因于条件。 (顺序:"Adapt"、"NoFilter"、"Filter"、"Book")。
此后,下一个层次结构应由电极定义。 (顺序:"F3"、"Fz"、"F4"、"C3"、"Cz"、"C4"、"P3"、"Pz" , "P4", "O1", "O2").
之后是 Nightpart(顺序:"firstHour"、"firstQuarter"、"secondQuarter"、"thirdQuarter"、"fourthQuarter"、"wholeNight"),最后是睡眠阶段 ("order: "N2", "N3", "NREM").
生成的顺序应如下所示:
[1] "Adapt_N2_negLengthLoc_F3_firstHour" "Adapt_N3_negLengthLoc_F3_firstHour"
[3] "Adapt_NREM_negLengthLoc_F3_firstHour" "Adapt_N2_negLengthLoc_F3_firstQuarter"
[5] "Adapt_N3_negLengthLoc_F3_firstQuarter" "Adapt_NREM_negLengthLoc_F3_firstQuarter"
[7] "Adapt_N2_negLengthLoc_F3_secondQuarter" "Adapt_N3_negLengthLoc_F3_secondQuarter"
[9] "Adapt_NREM_negLengthLoc_F3_secondQuarter" "Adapt_N2_negLengthLoc_F3_thirdQuarter"
[11] "Adapt_N3_negLengthLoc_F3_thirdQuarter" "Adapt_NREM_negLengthLoc_F3_thirdQuarter"
[13] "Adapt_N2_negLengthLoc_F3_fourthQuarter" "Adapt_N3_negLengthLoc_F3_fourthQuarter"
[15] "Adapt_NREM_negLengthLoc_F3_fourthQuarter" "Adapt_N2_negLengthLoc_F3_wholeNight"
[17] "Adapt_N3_negLengthLoc_F3_wholeNight" "Adapt_NREM_negLengthLoc_F3_wholeNight"
[19] "Adapt_N2_negLengthLoc_Fz_firstHour" "Adapt_N3_negLengthLoc_Fz_firstHour"
...
我希望有人能帮助我,如果还有任何问题,我当然很乐意提供更多信息!
提前致谢!
用 mtcars
数据进行说明,可以通过创建具有所需顺序的向量来重新排序数据框中的列,并在具有 [
形式的列规范中使用它提取运算符。
首先,我们将使用colnames()
提取列的原始顺序并打印它们
theNames <- colnames(mtcars)
theNames
> theNames
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
接下来,我们将通过创建一个 reorderedNames
向量并将其与 [
一起使用,将所有整数列移动到数据框的左侧。
reorderedNames <- c("cyl" , "vs" , "am" , "gear" ,"carb","disp" ,
"drat", "wt" , "qsec", "mpg")
mtcars[,reorderedNames]
...以及输出的前几行:
> mtcars[,reorderedNames]
cyl vs am gear carb disp drat wt qsec mpg
Mazda RX4 6 0 1 4 4 160.0 3.90 2.620 16.46 21.0
Mazda RX4 Wag 6 0 1 4 4 160.0 3.90 2.875 17.02 21.0
Datsun 710 4 1 1 4 1 108.0 3.85 2.320 18.61 22.8
Hornet 4 Drive 6 1 0 3 1 258.0 3.08 3.215 19.44 21.4
Hornet Sportabout 8 0 0 3 2 360.0 3.15 3.440 17.02 18.7
Valiant 6 1 0 3 1 225.0 2.76 3.460 20.22 18.1
Duster 360 8 0 0 3 4 360.0 3.21 3.570 15.84 14.3
Merc 240D 4 1 0 4 2 146.7 3.69 3.190 20.00 24.4
为大型数据框自动执行此过程
在 OP 中,问题引用了一个包含大量列的数据框。为了扩展此过程以自动对列进行排序,至少有两种主要方法。
- 以一种允许按所需顺序对列名进行排序的方式处理数据框中的列名,或者
- 通过使用
pivot_longer()
将列名拆分为所需的分组变量来创建窄格式整洁数据集。
我们将使用 OP 中的数据来说明方法 1。
columnName <- c("Adapt_N2_negLengthLoc_C3_firstHour","Adapt_N3_negLengthLoc_C3_firstHour",
"Adapt_NREM_negLengthLoc_C3_firstHour","Book_N2_negLengthLoc_C3_firstHour",
"Book_N3_negLengthLoc_C3_firstHour","Book_NREM_negLengthLoc_C3_firstHour")
splitCols <- strsplit(columnName,"_")
results <- lapply(splitCols,function(x){
parameter <- x[3]
condition <- x[1]
electrode <- x[4]
nightpart <- x[5]
sleepstage <- x[2]
data.frame(parameter,condition,electrode,nightpart,sleepstage)
})
colsData <- do.call(rbind,results)
# add original column names back into data
colsData <- cbind(columnName,colsData)
# convert to factors, specifying the factor order for sorting
conditionOrder <- c("Adapt", "NoFilter", "Filter", "Book")
parameterOrder <- c("negLengthLoc", "posLength", "wholeLength", "negPeak", "nbnegPeaks",
"initialMeannegSlope", "finalMeannegSlope", "initialMaxnegslope",
"finalMaxnegslope", "posPeak", "nbposPeaks", "initialMeannposSlope",
"finalMeanposSlope", "initialMaxposSlope", "PeaktoPeak", "Number", "Density")
electrodeOrder <- c("F3", "Fz", "F4", "C3", "Cz", "C4", "P3", "Pz", "P4", "O1", "O2")
nightpartOrder <- c("firstHour", "firstQuarter", "secondQuarter", "thirdQuarter", "fourthQuarter", "wholeNight")
sleepstageOrder <- c("N2", "N3", "NREM")
colsData$condition <- factor(colsData$condition,levels = conditionOrder,ordered = TRUE)
colsData$parameter <- factor(colsData$parameter,levels = parameterOrder,ordered = TRUE)
colsData$electrode <- factor(colsData$electrode,levels = electrodeOrder,ordered = TRUE)
colsData$nightpart <- factor(colsData$nightpart,levels = nightpartOrder,ordered = TRUE)
colsData$sleepstage <- factor(colsData$sleepstage,levels = sleepstageOrder,ordered = TRUE)
# finally, sort by factors & create a vector for column number
library(dplyr)
colsData <- arrange(colsData,condition,parameter,electrode,nightpart,sleepstage)
colsData$colId <- 1:nrow(colsData)
colsData
...输出:
> colsData
columnName parameter condition electrode nightpart
1 Adapt_N2_negLengthLoc_C3_firstHour negLengthLoc Adapt C3 firstHour
2 Adapt_N3_negLengthLoc_C3_firstHour negLengthLoc Adapt C3 firstHour
3 Adapt_NREM_negLengthLoc_C3_firstHour negLengthLoc Adapt C3 firstHour
4 Book_N2_negLengthLoc_C3_firstHour negLengthLoc Book C3 firstHour
5 Book_N3_negLengthLoc_C3_firstHour negLengthLoc Book C3 firstHour
6 Book_NREM_negLengthLoc_C3_firstHour negLengthLoc Book C3 firstHour
sleepstage colId
1 N2 1
2 N3 2
3 NREM 3
4 N2 4
5 N3 5
6 NREM 6
>
此时我们可以使用colsData$columnName
对原始数据框进行排序。
您必须将列名拆分成它所包含的不同部分。这是通过 stringr
包中的 str_split
完成的。它生成一个列表,每个列名都有一个条目,每个条目都是一个包含不同部分的字符向量。
要创建包含不同部分的新列,我使用 purrr
包中的 map_chr
来访问每个列名称的相应条目。然后,排列列。要实现您想要的顺序,请将字符转换为 factor
并使用 levels
指定顺序。列的新顺序由列 rowid
:
表示
old_order <- data.frame(col_names = c("Adapt_N2_negLengthLoc_C3_firstHour", "Adapt_N3_negLengthLoc_C3_firstHour",
"Adapt_NREM_negLengthLoc_C3_firstHour", "Book_N2_negLengthLoc_C3_firstHour",
"Book_N3_negLengthLoc_C3_firstHour", "Book_NREM_negLengthLoc_C3_firstHour",
"Adapt_N2_negLengthLoc_Fz_firstHour", "Adapt_N3_negLengthLoc_Fz_firstHour"))
library(dplyr)
library(stringr)
splitted_names <- str_split(old_order$col_names, "_")
new_order <- old_order %>%
tibble::rowid_to_column() %>%
mutate(Condition = purrr::map_chr(splitted_names, `[`, 1),
Sleepstage = purrr::map_chr(splitted_names, `[`, 2),
Parameter = purrr::map_chr(splitted_names, `[`, 3),
Electrode = purrr::map_chr(splitted_names, `[`, 4),
Nightpart = purrr::map_chr(splitted_names, `[`, 5)) %>%
arrange(factor(Parameter, levels = c("negLengthLoc", "posLength", "wholeLength", "negPeak", "nbnegPeaks", "initialMeannegSlope", "finalMeannegSlope", "initialMaxnegslope", "finalMaxnegslope", "posPeak", "nbposPeaks", "initialMeannposSlope", "finalMeanposSlope", "initialMaxposSlope", "PeaktoPeak", "Number", "Density")),
factor(Condition, levels = c("Adapt", "NoFilter", "Filter", "Book")),
factor(Electrode, levels = c("F3", "Fz", "F4", "C3", "Cz", "C4", "P3", "Pz", "P4", "O1", "O2")),
factor(Nightpart, levels = c("firstHour", "firstQuarter", "secondQuarter", "thirdQuarter", "fourthQuarter", "wholeNight")),
factor(Sleepstage, levels = c("N2", "N3", "NREM"))) %>%
pull(rowid)
old_order$col_names[new_order]
[1] Adapt_N2_negLengthLoc_Fz_firstHour Adapt_N3_negLengthLoc_Fz_firstHour Adapt_N2_negLengthLoc_C3_firstHour
[4] Adapt_N3_negLengthLoc_C3_firstHour Adapt_NREM_negLengthLoc_C3_firstHour Book_N2_negLengthLoc_C3_firstHour
[7] Book_N3_negLengthLoc_C3_firstHour Book_NREM_negLengthLoc_C3_firstHour
8 Levels: Adapt_N2_negLengthLoc_C3_firstHour ... Book_NREM_negLengthLoc_C3_firstHour
现在您已经将信息分成不同的列,我建议您将完整的数据集放入 tidy (long) format。
我想根据特定的理由对大型数据框(大约 14000 个变量)的列进行排序。
列名称具有以下结构 (Condition_Sleepstage_Parameter_Electrode_Nightpart):
[1] "Adapt_N2_negLengthLoc_C3_firstHour" "Adapt_N3_negLengthLoc_C3_firstHour"
[3] "Adapt_NREM_negLengthLoc_C3_firstHour" "Book_N2_negLengthLoc_C3_firstHour"
[5] "Book_N3_negLengthLoc_C3_firstHour" "Book_NREM_negLengthLoc_C3_firstHour"
R 中的列按纯字母顺序排列,但我希望它们按以下系统的逻辑顺序排列:
首先,变量应该在每个参数的块中呈现。 (顺序:"negLengthLoc"、"posLength"、"wholeLength"、"negPeak"、"nbnegPeaks"、"initialMeannegSlope"、"finalMeannegSlope"、"initialMaxnegslope" , "finalMaxnegslope", "posPeak", "nbposPeaks", "initialMeannposSlope", "finalMeanposSlope", "initialMaxposSlope", "PeaktoPeak", "Number", "Density")
在这些块中,最高阶层次结构应归因于条件。 (顺序:"Adapt"、"NoFilter"、"Filter"、"Book")。
此后,下一个层次结构应由电极定义。 (顺序:"F3"、"Fz"、"F4"、"C3"、"Cz"、"C4"、"P3"、"Pz" , "P4", "O1", "O2").
之后是 Nightpart(顺序:"firstHour"、"firstQuarter"、"secondQuarter"、"thirdQuarter"、"fourthQuarter"、"wholeNight"),最后是睡眠阶段 ("order: "N2", "N3", "NREM").
生成的顺序应如下所示:
[1] "Adapt_N2_negLengthLoc_F3_firstHour" "Adapt_N3_negLengthLoc_F3_firstHour"
[3] "Adapt_NREM_negLengthLoc_F3_firstHour" "Adapt_N2_negLengthLoc_F3_firstQuarter"
[5] "Adapt_N3_negLengthLoc_F3_firstQuarter" "Adapt_NREM_negLengthLoc_F3_firstQuarter"
[7] "Adapt_N2_negLengthLoc_F3_secondQuarter" "Adapt_N3_negLengthLoc_F3_secondQuarter"
[9] "Adapt_NREM_negLengthLoc_F3_secondQuarter" "Adapt_N2_negLengthLoc_F3_thirdQuarter"
[11] "Adapt_N3_negLengthLoc_F3_thirdQuarter" "Adapt_NREM_negLengthLoc_F3_thirdQuarter"
[13] "Adapt_N2_negLengthLoc_F3_fourthQuarter" "Adapt_N3_negLengthLoc_F3_fourthQuarter"
[15] "Adapt_NREM_negLengthLoc_F3_fourthQuarter" "Adapt_N2_negLengthLoc_F3_wholeNight"
[17] "Adapt_N3_negLengthLoc_F3_wholeNight" "Adapt_NREM_negLengthLoc_F3_wholeNight"
[19] "Adapt_N2_negLengthLoc_Fz_firstHour" "Adapt_N3_negLengthLoc_Fz_firstHour"
...
我希望有人能帮助我,如果还有任何问题,我当然很乐意提供更多信息!
提前致谢!
用 mtcars
数据进行说明,可以通过创建具有所需顺序的向量来重新排序数据框中的列,并在具有 [
形式的列规范中使用它提取运算符。
首先,我们将使用colnames()
提取列的原始顺序并打印它们
theNames <- colnames(mtcars)
theNames
> theNames
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"
接下来,我们将通过创建一个 reorderedNames
向量并将其与 [
一起使用,将所有整数列移动到数据框的左侧。
reorderedNames <- c("cyl" , "vs" , "am" , "gear" ,"carb","disp" ,
"drat", "wt" , "qsec", "mpg")
mtcars[,reorderedNames]
...以及输出的前几行:
> mtcars[,reorderedNames]
cyl vs am gear carb disp drat wt qsec mpg
Mazda RX4 6 0 1 4 4 160.0 3.90 2.620 16.46 21.0
Mazda RX4 Wag 6 0 1 4 4 160.0 3.90 2.875 17.02 21.0
Datsun 710 4 1 1 4 1 108.0 3.85 2.320 18.61 22.8
Hornet 4 Drive 6 1 0 3 1 258.0 3.08 3.215 19.44 21.4
Hornet Sportabout 8 0 0 3 2 360.0 3.15 3.440 17.02 18.7
Valiant 6 1 0 3 1 225.0 2.76 3.460 20.22 18.1
Duster 360 8 0 0 3 4 360.0 3.21 3.570 15.84 14.3
Merc 240D 4 1 0 4 2 146.7 3.69 3.190 20.00 24.4
为大型数据框自动执行此过程
在 OP 中,问题引用了一个包含大量列的数据框。为了扩展此过程以自动对列进行排序,至少有两种主要方法。
- 以一种允许按所需顺序对列名进行排序的方式处理数据框中的列名,或者
- 通过使用
pivot_longer()
将列名拆分为所需的分组变量来创建窄格式整洁数据集。
我们将使用 OP 中的数据来说明方法 1。
columnName <- c("Adapt_N2_negLengthLoc_C3_firstHour","Adapt_N3_negLengthLoc_C3_firstHour",
"Adapt_NREM_negLengthLoc_C3_firstHour","Book_N2_negLengthLoc_C3_firstHour",
"Book_N3_negLengthLoc_C3_firstHour","Book_NREM_negLengthLoc_C3_firstHour")
splitCols <- strsplit(columnName,"_")
results <- lapply(splitCols,function(x){
parameter <- x[3]
condition <- x[1]
electrode <- x[4]
nightpart <- x[5]
sleepstage <- x[2]
data.frame(parameter,condition,electrode,nightpart,sleepstage)
})
colsData <- do.call(rbind,results)
# add original column names back into data
colsData <- cbind(columnName,colsData)
# convert to factors, specifying the factor order for sorting
conditionOrder <- c("Adapt", "NoFilter", "Filter", "Book")
parameterOrder <- c("negLengthLoc", "posLength", "wholeLength", "negPeak", "nbnegPeaks",
"initialMeannegSlope", "finalMeannegSlope", "initialMaxnegslope",
"finalMaxnegslope", "posPeak", "nbposPeaks", "initialMeannposSlope",
"finalMeanposSlope", "initialMaxposSlope", "PeaktoPeak", "Number", "Density")
electrodeOrder <- c("F3", "Fz", "F4", "C3", "Cz", "C4", "P3", "Pz", "P4", "O1", "O2")
nightpartOrder <- c("firstHour", "firstQuarter", "secondQuarter", "thirdQuarter", "fourthQuarter", "wholeNight")
sleepstageOrder <- c("N2", "N3", "NREM")
colsData$condition <- factor(colsData$condition,levels = conditionOrder,ordered = TRUE)
colsData$parameter <- factor(colsData$parameter,levels = parameterOrder,ordered = TRUE)
colsData$electrode <- factor(colsData$electrode,levels = electrodeOrder,ordered = TRUE)
colsData$nightpart <- factor(colsData$nightpart,levels = nightpartOrder,ordered = TRUE)
colsData$sleepstage <- factor(colsData$sleepstage,levels = sleepstageOrder,ordered = TRUE)
# finally, sort by factors & create a vector for column number
library(dplyr)
colsData <- arrange(colsData,condition,parameter,electrode,nightpart,sleepstage)
colsData$colId <- 1:nrow(colsData)
colsData
...输出:
> colsData
columnName parameter condition electrode nightpart
1 Adapt_N2_negLengthLoc_C3_firstHour negLengthLoc Adapt C3 firstHour
2 Adapt_N3_negLengthLoc_C3_firstHour negLengthLoc Adapt C3 firstHour
3 Adapt_NREM_negLengthLoc_C3_firstHour negLengthLoc Adapt C3 firstHour
4 Book_N2_negLengthLoc_C3_firstHour negLengthLoc Book C3 firstHour
5 Book_N3_negLengthLoc_C3_firstHour negLengthLoc Book C3 firstHour
6 Book_NREM_negLengthLoc_C3_firstHour negLengthLoc Book C3 firstHour
sleepstage colId
1 N2 1
2 N3 2
3 NREM 3
4 N2 4
5 N3 5
6 NREM 6
>
此时我们可以使用colsData$columnName
对原始数据框进行排序。
您必须将列名拆分成它所包含的不同部分。这是通过 stringr
包中的 str_split
完成的。它生成一个列表,每个列名都有一个条目,每个条目都是一个包含不同部分的字符向量。
要创建包含不同部分的新列,我使用 purrr
包中的 map_chr
来访问每个列名称的相应条目。然后,排列列。要实现您想要的顺序,请将字符转换为 factor
并使用 levels
指定顺序。列的新顺序由列 rowid
:
old_order <- data.frame(col_names = c("Adapt_N2_negLengthLoc_C3_firstHour", "Adapt_N3_negLengthLoc_C3_firstHour",
"Adapt_NREM_negLengthLoc_C3_firstHour", "Book_N2_negLengthLoc_C3_firstHour",
"Book_N3_negLengthLoc_C3_firstHour", "Book_NREM_negLengthLoc_C3_firstHour",
"Adapt_N2_negLengthLoc_Fz_firstHour", "Adapt_N3_negLengthLoc_Fz_firstHour"))
library(dplyr)
library(stringr)
splitted_names <- str_split(old_order$col_names, "_")
new_order <- old_order %>%
tibble::rowid_to_column() %>%
mutate(Condition = purrr::map_chr(splitted_names, `[`, 1),
Sleepstage = purrr::map_chr(splitted_names, `[`, 2),
Parameter = purrr::map_chr(splitted_names, `[`, 3),
Electrode = purrr::map_chr(splitted_names, `[`, 4),
Nightpart = purrr::map_chr(splitted_names, `[`, 5)) %>%
arrange(factor(Parameter, levels = c("negLengthLoc", "posLength", "wholeLength", "negPeak", "nbnegPeaks", "initialMeannegSlope", "finalMeannegSlope", "initialMaxnegslope", "finalMaxnegslope", "posPeak", "nbposPeaks", "initialMeannposSlope", "finalMeanposSlope", "initialMaxposSlope", "PeaktoPeak", "Number", "Density")),
factor(Condition, levels = c("Adapt", "NoFilter", "Filter", "Book")),
factor(Electrode, levels = c("F3", "Fz", "F4", "C3", "Cz", "C4", "P3", "Pz", "P4", "O1", "O2")),
factor(Nightpart, levels = c("firstHour", "firstQuarter", "secondQuarter", "thirdQuarter", "fourthQuarter", "wholeNight")),
factor(Sleepstage, levels = c("N2", "N3", "NREM"))) %>%
pull(rowid)
old_order$col_names[new_order]
[1] Adapt_N2_negLengthLoc_Fz_firstHour Adapt_N3_negLengthLoc_Fz_firstHour Adapt_N2_negLengthLoc_C3_firstHour
[4] Adapt_N3_negLengthLoc_C3_firstHour Adapt_NREM_negLengthLoc_C3_firstHour Book_N2_negLengthLoc_C3_firstHour
[7] Book_N3_negLengthLoc_C3_firstHour Book_NREM_negLengthLoc_C3_firstHour
8 Levels: Adapt_N2_negLengthLoc_C3_firstHour ... Book_NREM_negLengthLoc_C3_firstHour
现在您已经将信息分成不同的列,我建议您将完整的数据集放入 tidy (long) format。