class tibble 的 Tibble 列而不是 class 数据框

Tibble columns of class tibble instead of class data frame

tibble 列 class tibble 的整洁方式是什么(而不是 class listdata.frame)?

显然可以在 tibble 中包含 class data.frame 列(请参阅 下面的示例),但是 "tidy ways of data manipulation" 的 none(即 dplyr::mutate()purrr::map*_df()) 在尝试将列转换为 tibble 而不是 data.frame

时似乎对我有用

jsonlite::fromJSON()

的当前输出
# 'data.frame': 2 obs. of  3 variables:
#  $ labels  :List of 2
#   ..$ : chr  "label-a" "label-b"
#   ..$ : chr  "label-a" "label-b"
#  $ levelOne:'data.frame': 2 obs. of  1 variable:
#   ..$ levelTwo:'data.frame':  2 obs. of  1 variable:
#   .. ..$ levelThree:List of 2
#   .. .. ..$ :'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  1 2
#   .. .. .. ..$ z: logi  TRUE FALSE
#   .. .. ..$ :'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  10 20
#   .. .. .. ..$ z: logi  FALSE TRUE
#  $ schema  : chr  "0.0.1" "0.0.1"

想要的结果

# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of  3 variables:
#  $ labels  :List of 2
#   ..$ : chr  "label-a" "label-b"
#   ..$ : chr  "label-a" "label-b"
#  $ levelOne:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of  1 variable:
#   ..$ levelTwo:Classes ‘tbl_df’, ‘tbl’ and 'data.frame':  2 obs. of  1 variable:
#   .. ..$ levelThree:List of 2
#   .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  1 2
#   .. .. .. ..$ z: logi  TRUE FALSE
#   .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  10 20
#   .. .. .. ..$ z: logi  FALSE TRUE
#  $ schema  : chr  "0.0.1" "0.0.1"

为什么 data.frame 列会产生很大的误导性

https://hendrikvanb.gitlab.io/2018/07/nested_data-json_to_tibble/

相关


例子

示例数据

library(magrittr)

json <- '[
  {
    "labels": ["label-a", "label-b"],
    "levelOne": {
      "levelTwo": {
        "levelThree": [
          {
            "x": "A",
            "y": 1,
            "z": true
          },
          {
            "x": "B",
            "y": 2,
            "z": false
          }
          ]
      }
    },
    "schema": "0.0.1"
  },
  {
    "labels": ["label-a", "label-b"],
    "levelOne": {
      "levelTwo": {
        "levelThree": [
          {
            "x": "A",
            "y": 10,
            "z": false
          },
          {
            "x": "B",
            "y": 20,
            "z": true
          }
          ]
      }
    },
    "schema": "0.0.1"
  }
]'

可视化时,您会发现 objects(映射到 data.frames)和 array(映射到 lists):

正在解析 JSON 并转换为 tibble

x <- json %>% 
  jsonlite::fromJSON() %>% 
  tibble::as_tibble()

x %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of  3 variables:
#  $ labels  :List of 2
#   ..$ : chr  "label-a" "label-b"
#   ..$ : chr  "label-a" "label-b"
#  $ levelOne:'data.frame': 2 obs. of  1 variable:
#   ..$ levelTwo:'data.frame':  2 obs. of  1 variable:
#   .. ..$ levelThree:List of 2
#   .. .. ..$ :'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  1 2
#   .. .. .. ..$ z: logi  TRUE FALSE
#   .. .. ..$ :'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  10 20
#   .. .. .. ..$ z: logi  FALSE TRUE
#  $ schema  : chr  "0.0.1" "0.0.1"

所以显然可以有 class data.frame.

的列

data.frame 转换为 tibble 列:"the bad way"

但我想要小标题而不是数据框,所以让我们试试我唯一得到的东西 工作:显式 re-assigning 各自的列表级别,或数据 frame/tibble 列,更准确地说:

# Make a copy so we don't mess with the initial state of `x`
y <- x

y$levelOne <- y$levelOne %>% 
  tibble::as_tibble()
y$levelOne$levelTwo <- y$levelOne$levelTwo %>% 
  tibble::as_tibble()
y$levelOne$levelTwo$levelThree <- y$levelOne$levelTwo$levelThree %>% 
  purrr::map(tibble::as_tibble)

x %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of  3 variables:
#  $ labels  :List of 2
#   ..$ : chr  "label-a" "label-b"
#   ..$ : chr  "label-a" "label-b"
#  $ levelOne:Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of  1 variable:
#   ..$ levelTwo:Classes ‘tbl_df’, ‘tbl’ and 'data.frame':  2 obs. of  1 variable:
#   .. ..$ levelThree:List of 2
#   .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  1 2
#   .. .. .. ..$ z: logi  TRUE FALSE
#   .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  10 20
#   .. .. .. ..$ z: logi  FALSE TRUE
#  $ schema  : chr  "0.0.1" "0.0.1"

可行,但不符合 "tidy data manipulation pipes"。

data.frame 转换为 tibble 列:"the better way"(尝试并失败)

# Yet another copy so we can compare:
z <- x

# Just to check that this works
z$levelOne %>% 
    tibble::as_tibble()
# # A tibble: 2 x 1
#   levelTwo$levelThree
#   <list>             
# 1 <df[,3] [2 × 3]>   
# 2 <df[,3] [2 × 3]>   

# Trying to get this to work with `dplzr::mutate()` fails:
z %>% 
  dplyr::mutate(levelOne = levelOne %>% 
    tibble::as_tibble()
  )
# Error: Column `levelOne` is of unsupported class data.frame

z %>% 
  dplyr::transmute(levelOne = levelOne %>% 
    tibble::as_tibble()
  )
# Error: Column `levelOne` is of unsupported class data.frame

# Same goes for `{purrr}`:
z %>% 
  dplyr::mutate(levelOne = levelOne %>% 
    purrr::map_df(tibble::as_tibble)
  )
# Error: Column `levelOne` is of unsupported class data.frame

z %>% 
  tibble::add_column(levelOne = z$levelOne %>% tibble::as_tibble())
# Error: Can't add duplicate columns with `add_column()`:
# * Column `levelOne` already exists in `.data`.

# Works, but not what I want:
z %>% 
  tibble::add_column(test = z$levelOne %>% tibble::as_tibble()) %>% 
  str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of  4 variables:
#  [...]
#  $ test    :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of  1 variable:
#   ..$ levelTwo:'data.frame':  2 obs. of  1 variable:
#   .. ..$ levelThree:List of 2
#   .. .. ..$ :'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  1 2
#   .. .. .. ..$ z: logi  TRUE FALSE
#   .. .. ..$ :'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  10 20
#   .. .. .. ..$ z: logi  FALSE TRUE

唯一有效的方法(不是我们想要的)

purrr::map() 包装 tibble::as_tibble() 似乎 可行,但结果显然不是我们想要的,因为我们复制了下面的所有内容 levelOne (与上面的期望输出相比)

# Works, but not what I want:
z_new <- z %>% 
  dplyr::mutate(levelOne = levelOne %>% 
    purrr::map(tibble::as_tibble)
  )

z_new %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of  3 variables:
#  $ labels  :List of 2
#   ..$ : chr  "label-a" "label-b"
#   ..$ : chr  "label-a" "label-b"
#  $ levelOne:List of 2
#   ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':  2 obs. of  1 variable:
#   .. ..$ levelThree:List of 2
#   .. .. ..$ :'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  1 2
#   .. .. .. ..$ z: logi  TRUE FALSE
#   .. .. ..$ :'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  10 20
#   .. .. .. ..$ z: logi  FALSE TRUE
#   ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':  2 obs. of  1 variable:
#   .. ..$ levelThree:List of 2
#   .. .. ..$ :'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  1 2
#   .. .. .. ..$ z: logi  TRUE FALSE
#   .. .. ..$ :'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  10 20
#   .. .. .. ..$ z: logi  FALSE TRUE
#  $ schema  : chr  "0.0.1" "0.0.1"

编辑(follow-up 调查)

在 Hendrik 的帮助下完成了!

不过,IMO 这个话题提出了一些有趣的 follow-up 问题,关于 是否应该 - 甚至 可以 - 如果主要 目标是最终得到整洁的嵌套 tibbles tidyr::unnset()tidyr::nest()(请参阅下面 Hendrik 的回答中的评论)。

关于提议的方法 https://hendrikvanb.gitlab.io/2018/07/nested_data-json_to_tibble/: 我可能是 忽略了一些明显的东西,但我认为它只适用于 JSON 文档 单个文档。

首先,让我们修改 df_to_tibble()(见下面 Hendrik 的回答)只转 "leaf" 数据帧变成小标题,同时将 "branch" 数据帧变成列表:

leaf_df_to_tibble <- function(x) {
  if (is.data.frame(x)) {
    if (!any(purrr::map_lgl(x, is.list))) { 
      # Only captures "leaf" DFs:
      tibble::as_tibble(x) 
    } else {
      as.list(x)
    }
  } else {
    x
  }
}

这将为我们提供与博客 [​​=154=] 中建议的方式一致的结果,但仅适用于 "single object" JSON 文档,如下图所示

df <- json %>% jsonlite::fromJSON()

# Only take the first object from the parsed JSON:
df_subset <- df[1, ]

转换df_subset:

df_subset_tibble <- purrr::reduce(
  0:purrr::vec_depth(df_subset),
  function(x, depth) {
    purrr::modify_depth(x, depth, leaf_df_to_tibble, .ragged = TRUE)
  }, 
  .init = df_subset
) %>% 
  tibble::as_tibble()

df_subset_tibble %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1 obs. of  3 variables:
#  $ labels  :List of 1
#   ..$ : chr  "label-a" "label-b"
#  $ levelOne:List of 1
#   ..$ levelTwo:List of 1
#   .. ..$ levelThree:List of 1
#   .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  1 2
#   .. .. .. ..$ z: logi  TRUE FALSE
#  $ schema  : chr "0.0.1"

转换df:

df_tibble <- purrr::reduce(
  0:purrr::vec_depth(df),
  function(x, depth) {
    purrr::modify_depth(x, depth, leaf_df_to_tibble, .ragged = TRUE)
  }, 
  .init = df
) %>% 
  tibble::as_tibble()

df_tibble %>% str()
# Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 2 obs. of  3 variables:
#  $ labels  :List of 2
#   ..$ : chr  "label-a" "label-b"
#   ..$ : chr  "label-a" "label-b"
#  $ levelOne:List of 2
#   ..$ levelTwo:List of 1
#   .. ..$ levelThree:List of 2
#   .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  1 2
#   .. .. .. ..$ z: logi  TRUE FALSE
#   .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  10 20
#   .. .. .. ..$ z: logi  FALSE TRUE
#   ..$ levelTwo:List of 1
#   .. ..$ levelThree:List of 2
#   .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  1 2
#   .. .. .. ..$ z: logi  TRUE FALSE
#   .. .. ..$ :Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    2 obs. of  3 variables:
#   .. .. .. ..$ x: chr  "A" "B"
#   .. .. .. ..$ y: int  10 20
#   .. .. .. ..$ z: logi  FALSE TRUE
#  $ schema  : chr  "0.0.1" "0.0.1"

如我们所见,"listifying" 嵌套 JSON 结构实际上可能导致复制 "leafs"。只要 n = 1(JSON 的数量,它就不会跳到你身上 文档),但一旦 n > 1.

就让你震惊

背景

以上评论提出了一些有道理的观点。尽管如此,我确实相信有一种方法可以通过结合使用 purrr 包中的三个函数来实现您所追求的目标(这是否是一个特别好的主意还不太清楚):

  1. purrr::vec_depth 允许我们获取给定列表的(嵌套)深度,
  2. purrr::modify_depth 允许我们将函数应用于指定深度级别的列表,并且
  3. purrr::reduce 允许我们迭代地应用一个函数,并将每次迭代的结果作为输入传递给后续迭代。

一般方法

本质上,我们想要将在列表中任何级别找到的任何 data.frame 转换为 tibble。这可以使用几轮 purrr::modify_depth 轻松实现,我们只需根据我们希望定位的列表级别更改深度。然而,至关重要的是,我们希望以一种方式做到这一点,例如,当我们进入目标级别 2 时,对级别 1 的更改将被保留;当我们进入第 3 级时,对第 1 级和第 2 级的更改将被保留;等等。这就是 purrr::reduce 的用武之地:每次我们应用 purrr::modify_depth 将 data.frame 转换为 tibble 时,我们将确保生成的输出作为输入传递给下一次迭代。这在下面的 MWE 中有所说明

MWE

从数据结构和库的基本设置开始

#> Load libraries ----
library(tidyverse)

json <- '[
  {
    "labels": ["label-a", "label-b"],
    "levelOne": {
      "levelTwo": {
        "levelThree": [
          {
            "x": "A",
            "y": 1,
            "z": true
          },
          {
            "x": "B",
            "y": 2,
            "z": false
          }
          ]
      }
    },
    "schema": "0.0.1"
  },
  {
    "labels": ["label-a", "label-b"],
    "levelOne": {
      "levelTwo": {
        "levelThree": [
          {
            "x": "A",
            "y": 10,
            "z": false
          },
          {
            "x": "B",
            "y": 20,
            "z": true
          }
          ]
      }
    },
    "schema": "0.0.1"
  }
]'  

# convert json to a nested data.frame
df <- jsonlite::fromJSON(json)

现在我们将创建一个简单的辅助函数,可以有条件地将 data.frame 转换为 tibble

# define a simple function to convert data.frame to tibble
df_to_tibble <- function(x) {
  if (is.data.frame(x)) as_tibble(x) else x
}

现在进入关键套路:以df为初始起点(.init = df),在df的每一层应用df_to_tibble函数(0:purrr::vec_depth(df)) 使用 purrr::modify_depth。使用 purrr::reduce 确保每次迭代的结果作为输入传递给后续迭代。

# create df_tibble by reducing the result of applying df_to_tibble to each level
# of df via purrr's modify_depth function %>% lastly, ensure that the top level
# data.frame is also converted to a tibble
df_tibble <- purrr::reduce(
  0:purrr::vec_depth(df),
  function(x, depth) {
    purrr::modify_depth(x, depth, df_to_tibble, .ragged = TRUE)
  }, 
  .init = df
) %>% 
  as_tibble()
# show the structure of df_tibble
str(df_tibble)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  3 variables:
#>  $ labels  :List of 2
#>   ..$ : chr  "label-a" "label-b"
#>   ..$ : chr  "label-a" "label-b"
#>  $ levelOne:Classes 'tbl_df', 'tbl' and 'data.frame':    2 obs. of  1 variable:
#>   ..$ levelTwo:Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of  1 variable:
#>   .. ..$ levelThree:List of 2
#>   .. .. ..$ :Classes 'tbl_df', 'tbl' and 'data.frame':   2 obs. of  3 variables:
#>   .. .. .. ..$ x: chr  "A" "B"
#>   .. .. .. ..$ y: int  1 2
#>   .. .. .. ..$ z: logi  TRUE FALSE
#>   .. .. ..$ :Classes 'tbl_df', 'tbl' and 'data.frame':   2 obs. of  3 variables:
#>   .. .. .. ..$ x: chr  "A" "B"
#>   .. .. .. ..$ y: int  10 20
#>   .. .. .. ..$ z: logi  FALSE TRUE
#>  $ schema  : chr  "0.0.1" "0.0.1"