仅分离变量名后转置

Transpose after separating only the variable name

我是 R 的新手,但我沉迷于精通!我正在做一个工作项目,我完全被难住了!非常感谢任何帮助!

我需要转换这个数据框...

   Brand       UK__Sales__YA   UK__Sales__MAT  CN__Sales__YA  CN__Sales__MAT
1  Snickers    100             110            90             95
2  Twix        50              60             30             35
3  Skittles    75              80             105            130

...到这个

   Brand       Country     Year      Sales
1  Snickers    UK          YA        100
2  Snickers    UK          MAT       110
3  Snickers    CN          YA        90
4  Snickers    CN          MAT       95
5  Twix        UK          YA        50
6  Twix        UK          MAT       60
7  Twix        CN          YA        30
8  Twix        CN          MAT       35
9  Skittles    UK          YA        75
10 Skittles    UK          MAT       80
11 Skittles    CN          YA        105
12 Skittles    CN          MAT       130

如您所知,我需要拆分销售变量的第一部分和最后一部分,并将它们创建为单独的数据堆栈。我的数据集中还有其他国家/地区和其他指标,但我认为如果您能帮我解决这个问题,那么我就可以完成它。谢谢!! :-)

查看 tidyr package -- in fact, all of the packages in the tidyverse 对此类数据处理工作有帮助:

library(tidyr)
library(dplyr)

df %>%
  gather(key, Sales, -Brand) %>%
  separate(key, c("Country", "delete", "Year"), sep = "__") %>%
  select(-delete) %>%
  arrange(Brand)

#       Brand Country Year Sales
# 1  Skittles      UK   YA    75
# 2  Skittles      UK  MAT    80
# 3  Skittles      CN   YA   105
# 4  Skittles      CN  MAT   130
# 5  Snickers      UK   YA   100
# 6  Snickers      UK  MAT   110
# 7  Snickers      CN   YA    90
# 8  Snickers      CN  MAT    95
# 9      Twix      UK   YA    50
# 10     Twix      UK  MAT    60
# 11     Twix      CN   YA    30
# 12     Twix      CN  MAT    35

要了解发生了什么,运行 每个管道 %>% 单独声明:(例如,查看 df %>% gather(key, Sales, -Brand) 之后的输出以了解其作用)。接下来 运行 通过 separate 管道进行转换。

这是 tidyverse 的一个选项。我们将 gather 转化为 'long' 格式然后 extract 将 'Var' 列转化为 'Country' 和 'Year'

library(tidyr)
library(dplyr)
gather(df1, Var, Sales, -Brand) %>%
    extract(Var, into = c("Country", "Year"), "(\w+)__\w+__(\w+)")
#      Brand Country Year Sales
#1  Snickers      UK   YA   100
#2      Twix      UK   YA    50
#3  Skittles      UK   YA    75
#4  Snickers      UK  MAT   110
#5      Twix      UK  MAT    60
#6  Skittles      UK  MAT    80
#7  Snickers      CN   YA    90
#8      Twix      CN   YA    30
#9  Skittles      CN   YA   105
#10 Snickers      CN  MAT    95
#11     Twix      CN  MAT    35
#12 Skittles      CN  MAT   130

data.table 对应的选项是

library(data.table)
melt(setDT(df1), id.var = "Brand", value.names = "Sales")[, 
 c("Country", "Year") := tstrsplit(variable, "__")[-2]][, variable := NULL][]

1) dplyr/tidyr 使用最后注释中可重复显示的数据,将数据框从宽到长的形式收集起来,然后分离出新专栏。使用 Value 列作为其中的值,将新的 Variable 列散布到 Price 和 Sales 中,然后进行排序。如果顺序无关紧要,最后一行代码可以省略。

library(dplyr)
library(tidyr)

DF %>% 
  gather(new, Value, -Brand) %>%
  separate(new, c("Country", "Variable", "Year"), sep = "__") %>%
  spread(Variable, Value) %>%
  arrange(Brand, desc(Country), desc(Year))

给予:

      Brand Country Year Sales
1  Skittles      UK   YA    75
2  Skittles      UK  MAT    80
3  Skittles      CN   YA   105
4  Skittles      CN  MAT   130
5  Snickers      UK   YA   100
6  Snickers      UK  MAT   110
7  Snickers      CN   YA    90
8  Snickers      CN  MAT    95
9      Twix      UK   YA    50
10     Twix      UK  MAT    60
11     Twix      CN   YA    30
12     Twix      CN  MAT    35

请注意,以上内容也适用 DF2 也在下面的注释中定义。

1a) 这个稍微短一点的替代方案也可以,但只适用于 DF,不适用于 DF2。同样,如果顺序无关紧要,可以省略 arrange 行。

DF %>% 
  gather(new, Sales, -Brand) %>%
  separate(new, c("Country", "Year"), sep = "__Sales__") %>%
  arrange(Brand, desc(Country), desc(Year))

2) 此替代方案不涉及使用 reshape 将宽格式重塑为长格式的包。如果行名和顺序无关紧要,则可以省略从 rownames(long) <- NULL 语句开始的所有内容。此代码也适用于 DF2.

varying <- split(names(DF)[-1], sub(".*__(.*)__.*", "\1", names(DF)[-1]))
long <- reshape(DF, dir = "long", idvar = "Brand", varying = varying, 
   v.names = names(varying))
out <- transform(long, Country = sub("__.*", "", time), Year = sub(".*__", "", time), 
   time = NULL)
rownames(out) <- NULL
o <- with(out, order(Brand, -xtfrm(Country), -xtfrm(Year)))
out <- out[o, ]
out

给予:

      Brand Sales Country Year
3  Skittles    75      UK   YA
6  Skittles    80      UK  MAT
9  Skittles   105      CN   YA
12 Skittles   130      CN  MAT
1  Snickers   100      UK   YA
4  Snickers   110      UK  MAT
7  Snickers    90      CN   YA
10 Snickers    95      CN  MAT
2      Twix    50      UK   YA
5      Twix    60      UK  MAT
8      Twix    30      CN   YA
11     Twix    35      CN  MAT

备注

Lines <- "   Brand       UK__Sales__YA   UK__Sales__MAT  CN__Sales__YA  CN__Sales__MAT
1  Snickers    100             110            90             95
2  Twix        50              60             30             35
3  Skittles    75              80             105            130"

DF <- read.table(text = Lines)

# same as DF but with additional columns for Price
DF2 <- cbind(DF, setNames(10 * DF[2:5], sub("Sales", "Price", names(DF)[2:5])))

这是一个使用包 reshape2 的解决方案。

new <- reshape2::melt(data, id.vars = "Brand")
new$Country <- sub("(^[^_]*)_.*$", "\1", new$variable)
new$Year <- sub("^.*_([[:alpha:]]*$)", "\1", new$variable)
new <- new[, c(1, 4, 5, 3)]
names(new)[4] <- "Sales"

head(new)
#     Brand Country Year Sales
#1 Snickers      UK   YA   100
#2     Twix      UK   YA    50
#3 Skittles      UK   YA    75
#4 Snickers      UK  MAT   110
#5     Twix      UK  MAT    60
#6 Skittles      UK  MAT    80

数据

data <-
structure(list(Brand = c("Snickers", "Twix", "Skittles"), UK__Sales__YA = c(100L, 
50L, 75L), UK__Sales__MAT = c(110L, 60L, 80L), CN__Sales__YA = c(90L, 
30L, 105L), CN__Sales__MAT = c(95L, 35L, 130L)), .Names = c("Brand", 
"UK__Sales__YA", "UK__Sales__MAT", "CN__Sales__YA", "CN__Sales__MAT"
), class = "data.frame", row.names = c("1", "2", "3"))