R : 解析和存储由 space 分隔的分层结构数据以及快速搜索和快速访问
R : Parsing and storing hierarchical structured data separated by space along with quick search and fast access
我有一个包含多个字段的大型数据集,其值由 space 分隔。
然后将这些字段组合成一条记录,每条记录都可以有可变长度的子项,缩进制表符。
文件内容如下所示:
company Samsung
type private
based South Korea
company Harman International
type private
based United States
industry Electronics
company JBL
type subsidiary
based United States
industry Audio
company Amazaon
type public
based United States
industry Cloud computing, e-commerce, artificial intelligence, consumer electronics
我想在存储这些记录的同时保持层次结构,并提供快速搜索选项和访问每条记录的方式。
到目前为止我想到了这个方法:
# reading file from the source
path <- "/path/to/file.txt"
content <- readLines(path, warn = F)
# replaces , with ; so it does not translate it as a separator in next step
content <- gsub(",", ";", content)
# creating list of fields and value
contentList <- read.csv(text=sub(" ", ",", content), header=FALSE)
# replacing ; with , to revert data in right format
contentList$V2 <- gsub(";", ",", contentList$V2)
经过上述步骤 contentList
看起来像这样:
在下一步中,我想到了使用一个函数来创建具有这些规则的列表:
- 如果该字段没有任何
\t
将其添加到列表中(作为命名向量)
- 如果该字段有一个或多个
\t
使其成为先前记录的子列表(作为命名向量)
但不知道如何在 R 中实现。
我该如何实施?
或者有没有更好的方法可以快速执行搜索和访问值来解决这个问题?
原始数据输入
raw <- read_lines("company Samsung
type private
based South Korea
company Harman International
type private
based United States
industry Electronics
company JBL
type subsidiary
based United States
industry Audio
company Amazaon
type public
based United States
industry Cloud computing, e-commerce, artificial intelligence, consumer electronics")
提出建议并获得契约
library(tidyverse)
rawDf <- tibble(RAW = raw)
companyIndenture <- rawDf %>%
filter(str_detect(RAW, "^\s*company")) %>%
mutate(LVL = case_when(
str_detect(RAW, "^\s{8}") ~ 3,
str_detect(RAW, "^\s{4}") ~ 2,
TRUE ~ 1),
COMPANY = str_replace(RAW, "^\s*company\s", "")) %>%
select(-RAW)
# Gives us
# A tibble: 4 x 2
# LVL COMPANY
# <dbl> <chr>
# 1 1 Samsung
# 2 2 Harman International
# 3 3 JBL
# 4 1 Amazaon
干净的空白
现在我们知道每个公司的 LVL 是多少,让我们去掉一些空白
nextly <- rawDf %>%
mutate(RAW = str_replace(RAW, "^\s*", "")) %>%
filter(RAW != "") %>%
separate(RAW, c("ATTR", "VALUE"), sep = " ", extra = "merge") %>%
# And bring the LVL back in
left_join(companyIndenture, by = c("VALUE" = "COMPANY")) %>%
select(LVL, ATTR, VALUE)
# A tibble: 15 x 3
# LVL ATTR VALUE
# <dbl> <chr> <chr>
# 1 1 company Samsung
# 2 NA type private
# 3 NA based South Korea
# 4 2 company Harman International
# 5 NA type private
# 6 NA based United States
# 7 NA industry Electronics
# 8 3 company JBL
# 9 NA type subsidiary
# 10 NA based United States
# 11 NA industry Audio
# 12 1 company Amazaon
# 13 NA type public
# 14 NA based United States
# 15 NA industry Cloud computing, e-commerce, artificial intelligence, consumer electronics
分配层次结构
每家公司都有一个 LVL.1、LVL.2、LVL.3 结构。当我们 fill``.
时,“”就可以解决问题
further <- nextly %>%
mutate(LVL.1 = ifelse(LVL == 1, VALUE, NA_character_),
LVL.2 = case_when(LVL == 1 ~ "",
LVL == 2 ~ VALUE,
TRUE ~ NA_character_),
LVL.3 = ifelse(LVL == 3, VALUE, "")) %>%
fill(starts_with("LVL.")) %>%
filter(ATTR != "company") %>%
select(LVL.1, LVL.2, LVL.3, ATTR, VALUE)
# A tibble: 11 x 5
# LVL.1 LVL.2 LVL.3 ATTR VALUE
# <chr> <chr> <chr> <chr> <chr>
# 1 Samsung "" "" type private
# 2 Samsung "" "" based South Korea
# 3 Samsung Harman International "" type private
# 4 Samsung Harman International "" based United States
# 5 Samsung Harman International "" industry Electronics
# 6 Samsung Harman International JBL type subsidiary
# 7 Samsung Harman International JBL based United States
# 8 Samsung Harman International JBL industry Audio
# 9 Amazaon "" "" type public
# 10 Amazaon "" "" based United States
# 11 Amazaon "" "" industry Cloud computing, e-commerce, artificial intelligence, consumer electronics
处理亚马逊的多个行业
最后,让 str_split 和 unnes 那些 'industry' 亚马逊的价值。
finally <- further %>%
mutate(VALUE = str_split(VALUE, ",\s*")) %>%
unnest()
# A tibble: 14 x 5
# LVL.1 LVL.2 LVL.3 ATTR VALUE
# <chr> <chr> <chr> <chr> <chr>
# 1 Samsung "" "" type private
# 2 Samsung "" "" based South Korea
# 3 Samsung Harman International "" type private
# 4 Samsung Harman International "" based United States
# 5 Samsung Harman International "" industry Electronics
# 6 Samsung Harman International JBL type subsidiary
# 7 Samsung Harman International JBL based United States
# 8 Samsung Harman International JBL industry Audio
# 9 Amazaon "" "" type public
# 10 Amazaon "" "" based United States
# 11 Amazaon "" "" industry Cloud computing
# 12 Amazaon "" "" industry e-commerce
# 13 Amazaon "" "" industry artificial intelligence
# 14 Amazaon "" "" industry consumer electronics
Q.E.D.
LAGNAPPE
further %>%
spread(key = "ATTR", value = "VALUE") %>%
mutate(industry = str_split(industry, ",\s*")) %>%
unnest()
# A tibble: 7 x 6
LVL.1 LVL.2 LVL.3 based type industry
<chr> <chr> <chr> <chr> <chr> <chr>
1 Amazaon "" "" United States public Cloud computing
2 Amazaon "" "" United States public e-commerce
3 Amazaon "" "" United States public artificial intelligence
4 Amazaon "" "" United States public consumer electronics
5 Samsung "" "" South Korea private NA
6 Samsung Harman International "" United States private Electronics
7 Samsung Harman International JBL United States subsidiary Audio
利用最后注释中的内容,计算每个公司行开头的space,并使用gsubfn 将它们替换为给出L2 的级别编号。然后在删除前导 space 后,用冒号替换每行的第一个 space,给出 L3。该文件现在是 dcf 格式,因此请使用 read.dcf 给出 L4.
来读取它
现在生成一个 lv 变量,以数字形式给出级别编号,并为每一行生成连续的数字 ID。计算给定父级的父级 ID,然后用我们目前计算的内容构建一个数据框。树的总根将由 0 表示。从 DF 生成图的边列表 e,并将其转换为 igraph。从中生成简单路径并创建一个数据框 DF2,其中包含列路径、公司、类型、基础和行业,这样每一行代表除根以外的一个节点。
如果您愿意,可以将 lv 和 parent 添加到我们计算但未添加的数据框中,因为您可能不需要它们。
下面假设每次缩进4space秒。
关卡的深度没有限制。
我们可以使用数据框操作搜索 DF2 以进行各种基于文本的查询,例如
subset(DF2, grepl("Samsung", paths)) # Samsung and its descendents
或者我们可以使用 igraph 函数对 g 进行图形查询,例如
max(length(get.diameter(g))) - 1 # max depth not counting root
或者我们可以使用 data.tree 函数进行查询
dt$height - 1 # max depth not counting root
代码
代码如下。
library(gsubfn)
content <- readLines(textConnection(Lines))
L2 <- gsubfn("( *)company", ~ paste0("level ", nchar(x) / 4L + 1L, "\ncompany"), content)
L3 <- sub(" ", ":", trimws(readLines(textConnection(L2))))
L4 <- read.dcf(textConnection(L3))
lv <- as.numeric(L4[, 1])
id <- seq_along(lv)
company <- L4[, "company"]
parent <- sapply(id, function(i) c(tail(which(lv[1:i] < lv[i]), 1), 0)[1])
DF <- data.frame(id = company[id], parent = c("0", company)[parent+1],
level = lv, L4[, -1], stringsAsFactors = FALSE)
e <- with(DF, cbind(parent, id))
igraph
现在我们有了一个边列表,我们可以创建一个 igraph 并使用该包对其进行处理。
library(igraph)
g <- graph_from_edgelist(e)
p <- all_simple_paths(g, "0")
paths <- sapply(p, function(x) paste(names(x), collapse = "/"))
DF2 <- data.frame(paths, L4[, -1], stringsAsFactors = FALSE)
DF2
给出路径列,后跟每个节点的属性:
paths company type based industry
1 0/Samsung Samsung private South Korea <NA>
2 0/Samsung/Harman International Harman International private United States Electronics
3 0/Samsung/Harman International/JBL JBL subsidiary United States Audio
4 0/Amazaon Amazaon public United States Cloud computing, e-commerce, artificial intelligence, consumer electronics
我们可以这样绘制图表:
plot(g, layout = layout_as_tree(g))
(接图)
data.tree
我们也可以使用 data.tree 及其许多函数来处理这个:
library(data.tree)
library(DiagrammeR)
dt <- FromDataFrameNetwork(DF)
print(dt, "type", "based", "industry")
给予:
levelName type based industry
1 0
2 ¦--Samsung private South Korea
3 ¦ °--Harman International private United States Electronics
4 ¦ °--JBL subsidiary United States Audio
5 °--Amazaon public United States Cloud computing, e-commerce, artificial intelligence, consumer electronics
我们可以将数据树数据绘制或转换如下
plot(dt) # plot in browser
ToListSimple(dt) # convert to nested list
ToListExplicit(dt) # similar but children in children component
备注
我们可以像这样可重复地创建内容:
Lines <- "
company Samsung
type private
based South Korea
company Harman International
type private
based United States
industry Electronics
company JBL
type subsidiary
based United States
industry Audio
company Amazaon
type public
based United States
industry Cloud computing, e-commerce, artificial intelligence, consumer electronics"
content <- readLines(textConnection(Lines))
我有一个包含多个字段的大型数据集,其值由 space 分隔。 然后将这些字段组合成一条记录,每条记录都可以有可变长度的子项,缩进制表符。
文件内容如下所示:
company Samsung
type private
based South Korea
company Harman International
type private
based United States
industry Electronics
company JBL
type subsidiary
based United States
industry Audio
company Amazaon
type public
based United States
industry Cloud computing, e-commerce, artificial intelligence, consumer electronics
我想在存储这些记录的同时保持层次结构,并提供快速搜索选项和访问每条记录的方式。
到目前为止我想到了这个方法:
# reading file from the source
path <- "/path/to/file.txt"
content <- readLines(path, warn = F)
# replaces , with ; so it does not translate it as a separator in next step
content <- gsub(",", ";", content)
# creating list of fields and value
contentList <- read.csv(text=sub(" ", ",", content), header=FALSE)
# replacing ; with , to revert data in right format
contentList$V2 <- gsub(";", ",", contentList$V2)
经过上述步骤 contentList
看起来像这样:
在下一步中,我想到了使用一个函数来创建具有这些规则的列表:
- 如果该字段没有任何
\t
将其添加到列表中(作为命名向量) - 如果该字段有一个或多个
\t
使其成为先前记录的子列表(作为命名向量)
但不知道如何在 R 中实现。
我该如何实施?
或者有没有更好的方法可以快速执行搜索和访问值来解决这个问题?
原始数据输入
raw <- read_lines("company Samsung
type private
based South Korea
company Harman International
type private
based United States
industry Electronics
company JBL
type subsidiary
based United States
industry Audio
company Amazaon
type public
based United States
industry Cloud computing, e-commerce, artificial intelligence, consumer electronics")
提出建议并获得契约
library(tidyverse)
rawDf <- tibble(RAW = raw)
companyIndenture <- rawDf %>%
filter(str_detect(RAW, "^\s*company")) %>%
mutate(LVL = case_when(
str_detect(RAW, "^\s{8}") ~ 3,
str_detect(RAW, "^\s{4}") ~ 2,
TRUE ~ 1),
COMPANY = str_replace(RAW, "^\s*company\s", "")) %>%
select(-RAW)
# Gives us
# A tibble: 4 x 2
# LVL COMPANY
# <dbl> <chr>
# 1 1 Samsung
# 2 2 Harman International
# 3 3 JBL
# 4 1 Amazaon
干净的空白
现在我们知道每个公司的 LVL 是多少,让我们去掉一些空白
nextly <- rawDf %>%
mutate(RAW = str_replace(RAW, "^\s*", "")) %>%
filter(RAW != "") %>%
separate(RAW, c("ATTR", "VALUE"), sep = " ", extra = "merge") %>%
# And bring the LVL back in
left_join(companyIndenture, by = c("VALUE" = "COMPANY")) %>%
select(LVL, ATTR, VALUE)
# A tibble: 15 x 3
# LVL ATTR VALUE
# <dbl> <chr> <chr>
# 1 1 company Samsung
# 2 NA type private
# 3 NA based South Korea
# 4 2 company Harman International
# 5 NA type private
# 6 NA based United States
# 7 NA industry Electronics
# 8 3 company JBL
# 9 NA type subsidiary
# 10 NA based United States
# 11 NA industry Audio
# 12 1 company Amazaon
# 13 NA type public
# 14 NA based United States
# 15 NA industry Cloud computing, e-commerce, artificial intelligence, consumer electronics
分配层次结构
每家公司都有一个 LVL.1、LVL.2、LVL.3 结构。当我们 fill``.
further <- nextly %>%
mutate(LVL.1 = ifelse(LVL == 1, VALUE, NA_character_),
LVL.2 = case_when(LVL == 1 ~ "",
LVL == 2 ~ VALUE,
TRUE ~ NA_character_),
LVL.3 = ifelse(LVL == 3, VALUE, "")) %>%
fill(starts_with("LVL.")) %>%
filter(ATTR != "company") %>%
select(LVL.1, LVL.2, LVL.3, ATTR, VALUE)
# A tibble: 11 x 5
# LVL.1 LVL.2 LVL.3 ATTR VALUE
# <chr> <chr> <chr> <chr> <chr>
# 1 Samsung "" "" type private
# 2 Samsung "" "" based South Korea
# 3 Samsung Harman International "" type private
# 4 Samsung Harman International "" based United States
# 5 Samsung Harman International "" industry Electronics
# 6 Samsung Harman International JBL type subsidiary
# 7 Samsung Harman International JBL based United States
# 8 Samsung Harman International JBL industry Audio
# 9 Amazaon "" "" type public
# 10 Amazaon "" "" based United States
# 11 Amazaon "" "" industry Cloud computing, e-commerce, artificial intelligence, consumer electronics
处理亚马逊的多个行业
最后,让 str_split 和 unnes 那些 'industry' 亚马逊的价值。
finally <- further %>%
mutate(VALUE = str_split(VALUE, ",\s*")) %>%
unnest()
# A tibble: 14 x 5
# LVL.1 LVL.2 LVL.3 ATTR VALUE
# <chr> <chr> <chr> <chr> <chr>
# 1 Samsung "" "" type private
# 2 Samsung "" "" based South Korea
# 3 Samsung Harman International "" type private
# 4 Samsung Harman International "" based United States
# 5 Samsung Harman International "" industry Electronics
# 6 Samsung Harman International JBL type subsidiary
# 7 Samsung Harman International JBL based United States
# 8 Samsung Harman International JBL industry Audio
# 9 Amazaon "" "" type public
# 10 Amazaon "" "" based United States
# 11 Amazaon "" "" industry Cloud computing
# 12 Amazaon "" "" industry e-commerce
# 13 Amazaon "" "" industry artificial intelligence
# 14 Amazaon "" "" industry consumer electronics
Q.E.D.
LAGNAPPE
further %>%
spread(key = "ATTR", value = "VALUE") %>%
mutate(industry = str_split(industry, ",\s*")) %>%
unnest()
# A tibble: 7 x 6
LVL.1 LVL.2 LVL.3 based type industry
<chr> <chr> <chr> <chr> <chr> <chr>
1 Amazaon "" "" United States public Cloud computing
2 Amazaon "" "" United States public e-commerce
3 Amazaon "" "" United States public artificial intelligence
4 Amazaon "" "" United States public consumer electronics
5 Samsung "" "" South Korea private NA
6 Samsung Harman International "" United States private Electronics
7 Samsung Harman International JBL United States subsidiary Audio
利用最后注释中的内容,计算每个公司行开头的space,并使用gsubfn 将它们替换为给出L2 的级别编号。然后在删除前导 space 后,用冒号替换每行的第一个 space,给出 L3。该文件现在是 dcf 格式,因此请使用 read.dcf 给出 L4.
来读取它现在生成一个 lv 变量,以数字形式给出级别编号,并为每一行生成连续的数字 ID。计算给定父级的父级 ID,然后用我们目前计算的内容构建一个数据框。树的总根将由 0 表示。从 DF 生成图的边列表 e,并将其转换为 igraph。从中生成简单路径并创建一个数据框 DF2,其中包含列路径、公司、类型、基础和行业,这样每一行代表除根以外的一个节点。
如果您愿意,可以将 lv 和 parent 添加到我们计算但未添加的数据框中,因为您可能不需要它们。
下面假设每次缩进4space秒。
关卡的深度没有限制。
我们可以使用数据框操作搜索 DF2 以进行各种基于文本的查询,例如
subset(DF2, grepl("Samsung", paths)) # Samsung and its descendents
或者我们可以使用 igraph 函数对 g 进行图形查询,例如
max(length(get.diameter(g))) - 1 # max depth not counting root
或者我们可以使用 data.tree 函数进行查询
dt$height - 1 # max depth not counting root
代码
代码如下。
library(gsubfn)
content <- readLines(textConnection(Lines))
L2 <- gsubfn("( *)company", ~ paste0("level ", nchar(x) / 4L + 1L, "\ncompany"), content)
L3 <- sub(" ", ":", trimws(readLines(textConnection(L2))))
L4 <- read.dcf(textConnection(L3))
lv <- as.numeric(L4[, 1])
id <- seq_along(lv)
company <- L4[, "company"]
parent <- sapply(id, function(i) c(tail(which(lv[1:i] < lv[i]), 1), 0)[1])
DF <- data.frame(id = company[id], parent = c("0", company)[parent+1],
level = lv, L4[, -1], stringsAsFactors = FALSE)
e <- with(DF, cbind(parent, id))
igraph
现在我们有了一个边列表,我们可以创建一个 igraph 并使用该包对其进行处理。
library(igraph)
g <- graph_from_edgelist(e)
p <- all_simple_paths(g, "0")
paths <- sapply(p, function(x) paste(names(x), collapse = "/"))
DF2 <- data.frame(paths, L4[, -1], stringsAsFactors = FALSE)
DF2
给出路径列,后跟每个节点的属性:
paths company type based industry
1 0/Samsung Samsung private South Korea <NA>
2 0/Samsung/Harman International Harman International private United States Electronics
3 0/Samsung/Harman International/JBL JBL subsidiary United States Audio
4 0/Amazaon Amazaon public United States Cloud computing, e-commerce, artificial intelligence, consumer electronics
我们可以这样绘制图表:
plot(g, layout = layout_as_tree(g))
(接图)
data.tree
我们也可以使用 data.tree 及其许多函数来处理这个:
library(data.tree)
library(DiagrammeR)
dt <- FromDataFrameNetwork(DF)
print(dt, "type", "based", "industry")
给予:
levelName type based industry
1 0
2 ¦--Samsung private South Korea
3 ¦ °--Harman International private United States Electronics
4 ¦ °--JBL subsidiary United States Audio
5 °--Amazaon public United States Cloud computing, e-commerce, artificial intelligence, consumer electronics
我们可以将数据树数据绘制或转换如下
plot(dt) # plot in browser
ToListSimple(dt) # convert to nested list
ToListExplicit(dt) # similar but children in children component
备注
我们可以像这样可重复地创建内容:
Lines <- "
company Samsung
type private
based South Korea
company Harman International
type private
based United States
industry Electronics
company JBL
type subsidiary
based United States
industry Audio
company Amazaon
type public
based United States
industry Cloud computing, e-commerce, artificial intelligence, consumer electronics"
content <- readLines(textConnection(Lines))