class tibble 的 Tibble 列而不是 class 数据框
Tibble columns of class tibble instead of class data frame
tibble
列 class tibble
的整洁方式是什么(而不是 class list
或 data.frame
)?
显然可以在 tibble
中包含 class data.frame
列(请参阅
下面的示例),但是 "tidy ways of data manipulation" 的 none(即
dplyr::mutate()
或 purrr::map*_df()
) 在尝试将列转换为 tibble
而不是 data.frame
时似乎对我有用
jsonlite::fromJSON()
的当前输出
# 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:'data.frame': 2 obs. of 1 variable:
# ..$ levelTwo:'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
想要的结果
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# ..$ levelTwo:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
为什么 data.frame
列会产生很大的误导性
https://hendrikvanb.gitlab.io/2018/07/nested_data-json_to_tibble/
相关
- Recursively ensuring tibbles instead of data frames when parsing/manipulating nested JSON
- Ensure that data frames become tibbles when reading MongoDB data with {mongolite}
例子
示例数据
library(magrittr)
json <- '[
{
"labels": ["label-a", "label-b"],
"levelOne": {
"levelTwo": {
"levelThree": [
{
"x": "A",
"y": 1,
"z": true
},
{
"x": "B",
"y": 2,
"z": false
}
]
}
},
"schema": "0.0.1"
},
{
"labels": ["label-a", "label-b"],
"levelOne": {
"levelTwo": {
"levelThree": [
{
"x": "A",
"y": 10,
"z": false
},
{
"x": "B",
"y": 20,
"z": true
}
]
}
},
"schema": "0.0.1"
}
]'
可视化时,您会发现 objects(映射到 data.frame
s)和 array(映射到 list
s):
正在解析 JSON 并转换为 tibble
x <- json %>%
jsonlite::fromJSON() %>%
tibble::as_tibble()
x %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:'data.frame': 2 obs. of 1 variable:
# ..$ levelTwo:'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
所以显然可以有 class data.frame
.
的列
将 data.frame
转换为 tibble
列:"the bad way"
但我想要小标题而不是数据框,所以让我们试试我唯一得到的东西
工作:显式 re-assigning 各自的列表级别,或数据 frame/tibble
列,更准确地说:
# Make a copy so we don't mess with the initial state of `x`
y <- x
y$levelOne <- y$levelOne %>%
tibble::as_tibble()
y$levelOne$levelTwo <- y$levelOne$levelTwo %>%
tibble::as_tibble()
y$levelOne$levelTwo$levelThree <- y$levelOne$levelTwo$levelThree %>%
purrr::map(tibble::as_tibble)
x %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# ..$ levelTwo:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
可行,但不符合 "tidy data manipulation pipes"。
将 data.frame
转换为 tibble
列:"the better way"(尝试并失败)
# Yet another copy so we can compare:
z <- x
# Just to check that this works
z$levelOne %>%
tibble::as_tibble()
# # A tibble: 2 x 1
# levelTwo$levelThree
# <list>
# 1 <df[,3] [2 × 3]>
# 2 <df[,3] [2 × 3]>
# Trying to get this to work with `dplzr::mutate()` fails:
z %>%
dplyr::mutate(levelOne = levelOne %>%
tibble::as_tibble()
)
# Error: Column `levelOne` is of unsupported class data.frame
z %>%
dplyr::transmute(levelOne = levelOne %>%
tibble::as_tibble()
)
# Error: Column `levelOne` is of unsupported class data.frame
# Same goes for `{purrr}`:
z %>%
dplyr::mutate(levelOne = levelOne %>%
purrr::map_df(tibble::as_tibble)
)
# Error: Column `levelOne` is of unsupported class data.frame
z %>%
tibble::add_column(levelOne = z$levelOne %>% tibble::as_tibble())
# Error: Can't add duplicate columns with `add_column()`:
# * Column `levelOne` already exists in `.data`.
# Works, but not what I want:
z %>%
tibble::add_column(test = z$levelOne %>% tibble::as_tibble()) %>%
str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 4 variables:
# [...]
# $ test :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# ..$ levelTwo:'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
唯一有效的方法(不是我们想要的)
用 purrr::map()
包装 tibble::as_tibble()
似乎 可行,但结果显然不是我们想要的,因为我们复制了下面的所有内容 levelOne
(与上面的期望输出相比)
# Works, but not what I want:
z_new <- z %>%
dplyr::mutate(levelOne = levelOne %>%
purrr::map(tibble::as_tibble)
)
z_new %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:List of 2
# ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
编辑(follow-up 调查)
在 Hendrik 的帮助下完成了!
不过,IMO 这个话题提出了一些有趣的 follow-up 问题,关于
是否应该 - 甚至 可以 - 如果主要
目标是最终得到整洁的嵌套 tibbles
tidyr::unnset()
和 tidyr::nest()
(请参阅下面 Hendrik 的回答中的评论)。
关于提议的方法
https://hendrikvanb.gitlab.io/2018/07/nested_data-json_to_tibble/: 我可能是
忽略了一些明显的东西,但我认为它只适用于 JSON 文档
单个文档。
首先,让我们修改 df_to_tibble()
(见下面 Hendrik 的回答)只转
"leaf" 数据帧变成小标题,同时将 "branch" 数据帧变成列表:
leaf_df_to_tibble <- function(x) {
if (is.data.frame(x)) {
if (!any(purrr::map_lgl(x, is.list))) {
# Only captures "leaf" DFs:
tibble::as_tibble(x)
} else {
as.list(x)
}
} else {
x
}
}
这将为我们提供与博客 [=154=] 中建议的方式一致的结果,但仅适用于 "single object" JSON 文档,如下图所示
df <- json %>% jsonlite::fromJSON()
# Only take the first object from the parsed JSON:
df_subset <- df[1, ]
转换df_subset
:
df_subset_tibble <- purrr::reduce(
0:purrr::vec_depth(df_subset),
function(x, depth) {
purrr::modify_depth(x, depth, leaf_df_to_tibble, .ragged = TRUE)
},
.init = df_subset
) %>%
tibble::as_tibble()
df_subset_tibble %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1 obs. of 3 variables:
# $ labels :List of 1
# ..$ : chr "label-a" "label-b"
# $ levelOne:List of 1
# ..$ levelTwo:List of 1
# .. ..$ levelThree:List of 1
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# $ schema : chr "0.0.1"
转换df
:
df_tibble <- purrr::reduce(
0:purrr::vec_depth(df),
function(x, depth) {
purrr::modify_depth(x, depth, leaf_df_to_tibble, .ragged = TRUE)
},
.init = df
) %>%
tibble::as_tibble()
df_tibble %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:List of 2
# ..$ levelTwo:List of 1
# .. ..$ levelThree:List of 2
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# ..$ levelTwo:List of 1
# .. ..$ levelThree:List of 2
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
如我们所见,"listifying" 嵌套 JSON 结构实际上可能导致复制
"leafs"。只要 n = 1
(JSON 的数量,它就不会跳到你身上
文档),但一旦 n > 1
.
就让你震惊
背景
以上评论提出了一些有道理的观点。尽管如此,我确实相信有一种方法可以通过结合使用 purrr
包中的三个函数来实现您所追求的目标(这是否是一个特别好的主意还不太清楚):
purrr::vec_depth
允许我们获取给定列表的(嵌套)深度,
purrr::modify_depth
允许我们将函数应用于指定深度级别的列表,并且
purrr::reduce
允许我们迭代地应用一个函数,并将每次迭代的结果作为输入传递给后续迭代。
一般方法
本质上,我们想要将在列表中任何级别找到的任何 data.frame
转换为 tibble
。这可以使用几轮 purrr::modify_depth
轻松实现,我们只需根据我们希望定位的列表级别更改深度。然而,至关重要的是,我们希望以一种方式做到这一点,例如,当我们进入目标级别 2 时,对级别 1 的更改将被保留;当我们进入第 3 级时,对第 1 级和第 2 级的更改将被保留;等等。这就是 purrr::reduce
的用武之地:每次我们应用 purrr::modify_depth
将 data.frame 转换为 tibble 时,我们将确保生成的输出作为输入传递给下一次迭代。这在下面的 MWE 中有所说明
MWE
从数据结构和库的基本设置开始
#> Load libraries ----
library(tidyverse)
json <- '[
{
"labels": ["label-a", "label-b"],
"levelOne": {
"levelTwo": {
"levelThree": [
{
"x": "A",
"y": 1,
"z": true
},
{
"x": "B",
"y": 2,
"z": false
}
]
}
},
"schema": "0.0.1"
},
{
"labels": ["label-a", "label-b"],
"levelOne": {
"levelTwo": {
"levelThree": [
{
"x": "A",
"y": 10,
"z": false
},
{
"x": "B",
"y": 20,
"z": true
}
]
}
},
"schema": "0.0.1"
}
]'
# convert json to a nested data.frame
df <- jsonlite::fromJSON(json)
现在我们将创建一个简单的辅助函数,可以有条件地将 data.frame
转换为 tibble
# define a simple function to convert data.frame to tibble
df_to_tibble <- function(x) {
if (is.data.frame(x)) as_tibble(x) else x
}
现在进入关键套路:以df
为初始起点(.init = df
),在df
的每一层应用df_to_tibble
函数(0:purrr::vec_depth(df)
) 使用 purrr::modify_depth
。使用 purrr::reduce
确保每次迭代的结果作为输入传递给后续迭代。
# create df_tibble by reducing the result of applying df_to_tibble to each level
# of df via purrr's modify_depth function %>% lastly, ensure that the top level
# data.frame is also converted to a tibble
df_tibble <- purrr::reduce(
0:purrr::vec_depth(df),
function(x, depth) {
purrr::modify_depth(x, depth, df_to_tibble, .ragged = TRUE)
},
.init = df
) %>%
as_tibble()
# show the structure of df_tibble
str(df_tibble)
#> Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 3 variables:
#> $ labels :List of 2
#> ..$ : chr "label-a" "label-b"
#> ..$ : chr "label-a" "label-b"
#> $ levelOne:Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 1 variable:
#> ..$ levelTwo:Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 1 variable:
#> .. ..$ levelThree:List of 2
#> .. .. ..$ :Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 3 variables:
#> .. .. .. ..$ x: chr "A" "B"
#> .. .. .. ..$ y: int 1 2
#> .. .. .. ..$ z: logi TRUE FALSE
#> .. .. ..$ :Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 3 variables:
#> .. .. .. ..$ x: chr "A" "B"
#> .. .. .. ..$ y: int 10 20
#> .. .. .. ..$ z: logi FALSE TRUE
#> $ schema : chr "0.0.1" "0.0.1"
tibble
列 class tibble
的整洁方式是什么(而不是 class list
或 data.frame
)?
显然可以在 tibble
中包含 class data.frame
列(请参阅
下面的示例),但是 "tidy ways of data manipulation" 的 none(即
dplyr::mutate()
或 purrr::map*_df()
) 在尝试将列转换为 tibble
而不是 data.frame
jsonlite::fromJSON()
的当前输出
# 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:'data.frame': 2 obs. of 1 variable:
# ..$ levelTwo:'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
想要的结果
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# ..$ levelTwo:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
为什么 data.frame
列会产生很大的误导性
https://hendrikvanb.gitlab.io/2018/07/nested_data-json_to_tibble/
相关
- Recursively ensuring tibbles instead of data frames when parsing/manipulating nested JSON
- Ensure that data frames become tibbles when reading MongoDB data with {mongolite}
例子
示例数据
library(magrittr)
json <- '[
{
"labels": ["label-a", "label-b"],
"levelOne": {
"levelTwo": {
"levelThree": [
{
"x": "A",
"y": 1,
"z": true
},
{
"x": "B",
"y": 2,
"z": false
}
]
}
},
"schema": "0.0.1"
},
{
"labels": ["label-a", "label-b"],
"levelOne": {
"levelTwo": {
"levelThree": [
{
"x": "A",
"y": 10,
"z": false
},
{
"x": "B",
"y": 20,
"z": true
}
]
}
},
"schema": "0.0.1"
}
]'
可视化时,您会发现 objects(映射到 data.frame
s)和 array(映射到 list
s):
正在解析 JSON 并转换为 tibble
x <- json %>%
jsonlite::fromJSON() %>%
tibble::as_tibble()
x %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:'data.frame': 2 obs. of 1 variable:
# ..$ levelTwo:'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
所以显然可以有 class data.frame
.
将 data.frame
转换为 tibble
列:"the bad way"
但我想要小标题而不是数据框,所以让我们试试我唯一得到的东西 工作:显式 re-assigning 各自的列表级别,或数据 frame/tibble 列,更准确地说:
# Make a copy so we don't mess with the initial state of `x`
y <- x
y$levelOne <- y$levelOne %>%
tibble::as_tibble()
y$levelOne$levelTwo <- y$levelOne$levelTwo %>%
tibble::as_tibble()
y$levelOne$levelTwo$levelThree <- y$levelOne$levelTwo$levelThree %>%
purrr::map(tibble::as_tibble)
x %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# ..$ levelTwo:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
可行,但不符合 "tidy data manipulation pipes"。
将 data.frame
转换为 tibble
列:"the better way"(尝试并失败)
# Yet another copy so we can compare:
z <- x
# Just to check that this works
z$levelOne %>%
tibble::as_tibble()
# # A tibble: 2 x 1
# levelTwo$levelThree
# <list>
# 1 <df[,3] [2 × 3]>
# 2 <df[,3] [2 × 3]>
# Trying to get this to work with `dplzr::mutate()` fails:
z %>%
dplyr::mutate(levelOne = levelOne %>%
tibble::as_tibble()
)
# Error: Column `levelOne` is of unsupported class data.frame
z %>%
dplyr::transmute(levelOne = levelOne %>%
tibble::as_tibble()
)
# Error: Column `levelOne` is of unsupported class data.frame
# Same goes for `{purrr}`:
z %>%
dplyr::mutate(levelOne = levelOne %>%
purrr::map_df(tibble::as_tibble)
)
# Error: Column `levelOne` is of unsupported class data.frame
z %>%
tibble::add_column(levelOne = z$levelOne %>% tibble::as_tibble())
# Error: Can't add duplicate columns with `add_column()`:
# * Column `levelOne` already exists in `.data`.
# Works, but not what I want:
z %>%
tibble::add_column(test = z$levelOne %>% tibble::as_tibble()) %>%
str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 4 variables:
# [...]
# $ test :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# ..$ levelTwo:'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
唯一有效的方法(不是我们想要的)
用 purrr::map()
包装 tibble::as_tibble()
似乎 可行,但结果显然不是我们想要的,因为我们复制了下面的所有内容 levelOne
(与上面的期望输出相比)
# Works, but not what I want:
z_new <- z %>%
dplyr::mutate(levelOne = levelOne %>%
purrr::map(tibble::as_tibble)
)
z_new %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:List of 2
# ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 1 variable:
# .. ..$ levelThree:List of 2
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
编辑(follow-up 调查)
在 Hendrik 的帮助下完成了!
不过,IMO 这个话题提出了一些有趣的 follow-up 问题,关于
是否应该 - 甚至 可以 - 如果主要
目标是最终得到整洁的嵌套 tibbles
tidyr::unnset()
和 tidyr::nest()
(请参阅下面 Hendrik 的回答中的评论)。
关于提议的方法 https://hendrikvanb.gitlab.io/2018/07/nested_data-json_to_tibble/: 我可能是 忽略了一些明显的东西,但我认为它只适用于 JSON 文档 单个文档。
首先,让我们修改 df_to_tibble()
(见下面 Hendrik 的回答)只转
"leaf" 数据帧变成小标题,同时将 "branch" 数据帧变成列表:
leaf_df_to_tibble <- function(x) {
if (is.data.frame(x)) {
if (!any(purrr::map_lgl(x, is.list))) {
# Only captures "leaf" DFs:
tibble::as_tibble(x)
} else {
as.list(x)
}
} else {
x
}
}
这将为我们提供与博客 [=154=] 中建议的方式一致的结果,但仅适用于 "single object" JSON 文档,如下图所示
df <- json %>% jsonlite::fromJSON()
# Only take the first object from the parsed JSON:
df_subset <- df[1, ]
转换df_subset
:
df_subset_tibble <- purrr::reduce(
0:purrr::vec_depth(df_subset),
function(x, depth) {
purrr::modify_depth(x, depth, leaf_df_to_tibble, .ragged = TRUE)
},
.init = df_subset
) %>%
tibble::as_tibble()
df_subset_tibble %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1 obs. of 3 variables:
# $ labels :List of 1
# ..$ : chr "label-a" "label-b"
# $ levelOne:List of 1
# ..$ levelTwo:List of 1
# .. ..$ levelThree:List of 1
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# $ schema : chr "0.0.1"
转换df
:
df_tibble <- purrr::reduce(
0:purrr::vec_depth(df),
function(x, depth) {
purrr::modify_depth(x, depth, leaf_df_to_tibble, .ragged = TRUE)
},
.init = df
) %>%
tibble::as_tibble()
df_tibble %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# $ labels :List of 2
# ..$ : chr "label-a" "label-b"
# ..$ : chr "label-a" "label-b"
# $ levelOne:List of 2
# ..$ levelTwo:List of 1
# .. ..$ levelThree:List of 2
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# ..$ levelTwo:List of 1
# .. ..$ levelThree:List of 2
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 1 2
# .. .. .. ..$ z: logi TRUE FALSE
# .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of 3 variables:
# .. .. .. ..$ x: chr "A" "B"
# .. .. .. ..$ y: int 10 20
# .. .. .. ..$ z: logi FALSE TRUE
# $ schema : chr "0.0.1" "0.0.1"
如我们所见,"listifying" 嵌套 JSON 结构实际上可能导致复制
"leafs"。只要 n = 1
(JSON 的数量,它就不会跳到你身上
文档),但一旦 n > 1
.
背景
以上评论提出了一些有道理的观点。尽管如此,我确实相信有一种方法可以通过结合使用 purrr
包中的三个函数来实现您所追求的目标(这是否是一个特别好的主意还不太清楚):
purrr::vec_depth
允许我们获取给定列表的(嵌套)深度,purrr::modify_depth
允许我们将函数应用于指定深度级别的列表,并且purrr::reduce
允许我们迭代地应用一个函数,并将每次迭代的结果作为输入传递给后续迭代。
一般方法
本质上,我们想要将在列表中任何级别找到的任何 data.frame
转换为 tibble
。这可以使用几轮 purrr::modify_depth
轻松实现,我们只需根据我们希望定位的列表级别更改深度。然而,至关重要的是,我们希望以一种方式做到这一点,例如,当我们进入目标级别 2 时,对级别 1 的更改将被保留;当我们进入第 3 级时,对第 1 级和第 2 级的更改将被保留;等等。这就是 purrr::reduce
的用武之地:每次我们应用 purrr::modify_depth
将 data.frame 转换为 tibble 时,我们将确保生成的输出作为输入传递给下一次迭代。这在下面的 MWE 中有所说明
MWE
从数据结构和库的基本设置开始
#> Load libraries ----
library(tidyverse)
json <- '[
{
"labels": ["label-a", "label-b"],
"levelOne": {
"levelTwo": {
"levelThree": [
{
"x": "A",
"y": 1,
"z": true
},
{
"x": "B",
"y": 2,
"z": false
}
]
}
},
"schema": "0.0.1"
},
{
"labels": ["label-a", "label-b"],
"levelOne": {
"levelTwo": {
"levelThree": [
{
"x": "A",
"y": 10,
"z": false
},
{
"x": "B",
"y": 20,
"z": true
}
]
}
},
"schema": "0.0.1"
}
]'
# convert json to a nested data.frame
df <- jsonlite::fromJSON(json)
现在我们将创建一个简单的辅助函数,可以有条件地将 data.frame
转换为 tibble
# define a simple function to convert data.frame to tibble
df_to_tibble <- function(x) {
if (is.data.frame(x)) as_tibble(x) else x
}
现在进入关键套路:以df
为初始起点(.init = df
),在df
的每一层应用df_to_tibble
函数(0:purrr::vec_depth(df)
) 使用 purrr::modify_depth
。使用 purrr::reduce
确保每次迭代的结果作为输入传递给后续迭代。
# create df_tibble by reducing the result of applying df_to_tibble to each level
# of df via purrr's modify_depth function %>% lastly, ensure that the top level
# data.frame is also converted to a tibble
df_tibble <- purrr::reduce(
0:purrr::vec_depth(df),
function(x, depth) {
purrr::modify_depth(x, depth, df_to_tibble, .ragged = TRUE)
},
.init = df
) %>%
as_tibble()
# show the structure of df_tibble
str(df_tibble)
#> Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 3 variables:
#> $ labels :List of 2
#> ..$ : chr "label-a" "label-b"
#> ..$ : chr "label-a" "label-b"
#> $ levelOne:Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 1 variable:
#> ..$ levelTwo:Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 1 variable:
#> .. ..$ levelThree:List of 2
#> .. .. ..$ :Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 3 variables:
#> .. .. .. ..$ x: chr "A" "B"
#> .. .. .. ..$ y: int 1 2
#> .. .. .. ..$ z: logi TRUE FALSE
#> .. .. ..$ :Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 3 variables:
#> .. .. .. ..$ x: chr "A" "B"
#> .. .. .. ..$ y: int 10 20
#> .. .. .. ..$ z: logi FALSE TRUE
#> $ schema : chr "0.0.1" "0.0.1"