如何将水平行调整为跨大数据集的多行
How to adjust horizontal row to become multiple rows across largedataset
我们有一个大型数据集,我们想对其进行编辑和分析,但在我们开始之前,我们需要将数据转换为更实用的格式以进行统计分析。
````Incorrect format dataframe
library(tidyverse)
data <-
tribble(~id, ~date, ~start, ~end, ~start, ~end, ~start, ~end,
1001, "01/07/2019", "04:00", "08:00", "10:00", "15:00", "16:00", "20:00",
1001, "02/07/2019", "04:30", "05:30", "09:00", "14:00", "17:00", "21:00",
1009, "05/07/2019", "03:00", "05:00", "07:00", "14:00", "15:00", "19:00",
1009, "07/07/2019", "03:30", "04:30", "08:20", "15:20", "16:30", "20:30")
````Correct format dataframe
# id date start end
# 1001 01/07/2019 04:00 08:00
# 1001 01/07/2019 10:00 15:00
# 1001 01/07/2019 16:00 20:00
# 1001 02/07/2019 04:30 05:30
# 1001 02/07/2019 09:00 14:00
# 1001 02/07/2019 17:00 21:00
# 1009 05/07/2019 03:00 05:00
# 1009 05/07/2019 07:00 14:00
# 1009 05/07/2019 15:00 19:00
# 1009 07/07/2019 03:30 04:30
# 1009 07/07/2019 08:20 15:20
# 1009 07/07/2019 16:30 20:30
我可以手动操作我的数据,但我无法执行自动化功能。实际数据集有 32 列和 10,000 行。编辑:我尝试将 id 和 date 连接到每个值并进行排序,但使用此方法时出错。
下次如果你能发布一个可重现的数据示例(比如我下面代码中的那个)就好了。
您似乎想要将数据从宽格式转换为某种长格式。重复的列名造成了一些麻烦,但下面的代码应该可以解决问题。您必须为此安装 tidyverse 包:
library(tidyverse)
data <-
tribble(~id, ~date, ~start, ~end, ~start, ~end, ~start, ~end,
1001, "01/07/2019", "04:00", "08:00", "10:00", "15:00", "16:00", "20:00",
1001, "02/07/2019", "04:30", "05:30", "09:00", "14:00", "17:00", "21:00",
1009, "05/07/2019", "03:00", "05:00", "07:00", "14:00", "15:00", "19:00",
1009, "07/07/2019", "03:30", "04:30", "08:20", "15:20", "16:30", "20:30")
# make column names unique
names(data) <-
ifelse(names(data) %in% c("start","end"),
paste0(names(data),"_",1:length(names(data))),
names(data))
# turn data into long format
data %>%
gather(key,value,-id,-date) %>%
arrange(id,date) %>%
# get rid of the column suffixes
mutate(key = str_replace_all(key,pattern = c("_\d+"=""))) %>%
group_by(id,date,key) %>%
mutate(obs_id = row_number()) %>%
spread(key,value) %>%
ungroup() %>%
select(id,
date,
start,
end)
我们有一个大型数据集,我们想对其进行编辑和分析,但在我们开始之前,我们需要将数据转换为更实用的格式以进行统计分析。
````Incorrect format dataframe
library(tidyverse)
data <-
tribble(~id, ~date, ~start, ~end, ~start, ~end, ~start, ~end,
1001, "01/07/2019", "04:00", "08:00", "10:00", "15:00", "16:00", "20:00",
1001, "02/07/2019", "04:30", "05:30", "09:00", "14:00", "17:00", "21:00",
1009, "05/07/2019", "03:00", "05:00", "07:00", "14:00", "15:00", "19:00",
1009, "07/07/2019", "03:30", "04:30", "08:20", "15:20", "16:30", "20:30")
````Correct format dataframe
# id date start end
# 1001 01/07/2019 04:00 08:00
# 1001 01/07/2019 10:00 15:00
# 1001 01/07/2019 16:00 20:00
# 1001 02/07/2019 04:30 05:30
# 1001 02/07/2019 09:00 14:00
# 1001 02/07/2019 17:00 21:00
# 1009 05/07/2019 03:00 05:00
# 1009 05/07/2019 07:00 14:00
# 1009 05/07/2019 15:00 19:00
# 1009 07/07/2019 03:30 04:30
# 1009 07/07/2019 08:20 15:20
# 1009 07/07/2019 16:30 20:30
我可以手动操作我的数据,但我无法执行自动化功能。实际数据集有 32 列和 10,000 行。编辑:我尝试将 id 和 date 连接到每个值并进行排序,但使用此方法时出错。
下次如果你能发布一个可重现的数据示例(比如我下面代码中的那个)就好了。
您似乎想要将数据从宽格式转换为某种长格式。重复的列名造成了一些麻烦,但下面的代码应该可以解决问题。您必须为此安装 tidyverse 包:
library(tidyverse)
data <-
tribble(~id, ~date, ~start, ~end, ~start, ~end, ~start, ~end,
1001, "01/07/2019", "04:00", "08:00", "10:00", "15:00", "16:00", "20:00",
1001, "02/07/2019", "04:30", "05:30", "09:00", "14:00", "17:00", "21:00",
1009, "05/07/2019", "03:00", "05:00", "07:00", "14:00", "15:00", "19:00",
1009, "07/07/2019", "03:30", "04:30", "08:20", "15:20", "16:30", "20:30")
# make column names unique
names(data) <-
ifelse(names(data) %in% c("start","end"),
paste0(names(data),"_",1:length(names(data))),
names(data))
# turn data into long format
data %>%
gather(key,value,-id,-date) %>%
arrange(id,date) %>%
# get rid of the column suffixes
mutate(key = str_replace_all(key,pattern = c("_\d+"=""))) %>%
group_by(id,date,key) %>%
mutate(obs_id = row_number()) %>%
spread(key,value) %>%
ungroup() %>%
select(id,
date,
start,
end)