R : 解析和存储由 space 分隔的分层结构数据以及快速搜索和快速访问

R : Parsing and storing hierarchical structured data separated by space along with quick search and fast access

我有一个包含多个字段的大型数据集,其值由 space 分隔。 然后将这些字段组合成一条记录,每条记录都可以有可变长度的子项,缩进制表符。

文件内容如下所示:

company Samsung
type private
based South Korea

    company Harman International
    type private
    based United States
    industry Electronics

        company JBL
        type subsidiary
        based United States
        industry Audio

company Amazaon
type public
based United States
industry Cloud computing, e-commerce, artificial intelligence, consumer electronics

我想在存储这些记录的同时保持层次结构,并提供快速搜索选项和访问每条记录的方式。

到目前为止我想到了这个方法:

# reading file from the source
path <- "/path/to/file.txt"
content <- readLines(path, warn = F)


# replaces , with ; so it does not translate it as a separator in next step
content <- gsub(",", ";", content)

# creating list of fields and value
contentList <- read.csv(text=sub(" ", ",", content), header=FALSE)

# replacing ; with , to revert data in right format
contentList$V2 <- gsub(";", ",", contentList$V2)

经过上述步骤 contentList 看起来像这样:

在下一步中,我想到了使用一个函数来创建具有这些规则的列表:

  1. 如果该字段没有任何 \t 将其添加到列表中(作为命名向量)
  2. 如果该字段有一个或多个 \t 使其成为先前记录的子列表(作为命名向量)

但不知道如何在 R 中实现。

我该如何实施?

或者有没有更好的方法可以快速执行搜索和访问值来解决这个问题?

原始数据输入

raw <- read_lines("company Samsung
type private
based South Korea

    company Harman International
    type private
    based United States
    industry Electronics

        company JBL
        type subsidiary
        based United States
        industry Audio

company Amazaon
type public
based United States
industry Cloud computing, e-commerce, artificial intelligence, consumer electronics")

提出建议并获得契约

library(tidyverse)
rawDf <- tibble(RAW = raw)

companyIndenture <- rawDf %>% 
    filter(str_detect(RAW, "^\s*company")) %>% 
    mutate(LVL = case_when(
      str_detect(RAW, "^\s{8}") ~ 3,
      str_detect(RAW, "^\s{4}") ~ 2,
      TRUE ~ 1),
      COMPANY = str_replace(RAW, "^\s*company\s", "")) %>% 
    select(-RAW)
# Gives us    
# A tibble: 4 x 2
# LVL COMPANY             
# <dbl> <chr>               
# 1     1 Samsung             
# 2     2 Harman International
# 3     3 JBL                 
# 4     1 Amazaon             

干净的空白

现在我们知道每个公司的 LVL 是多少,让我们去掉一些空白

nextly <- rawDf %>% 
  mutate(RAW = str_replace(RAW, "^\s*", "")) %>% 
  filter(RAW != "") %>% 
  separate(RAW, c("ATTR", "VALUE"), sep = " ", extra = "merge") %>% 
    # And bring the LVL back in
  left_join(companyIndenture, by = c("VALUE" = "COMPANY")) %>% 
  select(LVL, ATTR, VALUE)

# A tibble: 15 x 3
# LVL ATTR     VALUE                                                                     
# <dbl> <chr>    <chr>                                                                     
# 1     1 company  Samsung                                                                   
# 2    NA type     private                                                                   
# 3    NA based    South Korea                                                               
# 4     2 company  Harman International                                                      
# 5    NA type     private                                                                   
# 6    NA based    United States                                                             
# 7    NA industry Electronics                                                               
# 8     3 company  JBL                                                                       
# 9    NA type     subsidiary                                                                
# 10    NA based    United States                                                             
# 11    NA industry Audio                                                                     
# 12     1 company  Amazaon                                                                   
# 13    NA type     public                                                                    
# 14    NA based    United States                                                             
# 15    NA industry Cloud computing, e-commerce, artificial intelligence, consumer electronics

分配层次结构

每家公司都有一个 LVL.1、LVL.2、LVL.3 结构。当我们 fill``.

时,“”就可以解决问题
further <- nextly %>% 
  mutate(LVL.1 = ifelse(LVL == 1, VALUE, NA_character_),
         LVL.2 = case_when(LVL == 1 ~ "",
                           LVL == 2 ~ VALUE,
                           TRUE ~ NA_character_),
         LVL.3 = ifelse(LVL == 3, VALUE, "")) %>% 
  fill(starts_with("LVL.")) %>% 
  filter(ATTR != "company") %>% 
  select(LVL.1, LVL.2, LVL.3, ATTR, VALUE)


# A tibble: 11 x 5
# LVL.1   LVL.2                LVL.3 ATTR     VALUE                                                                     
# <chr>   <chr>                <chr> <chr>    <chr>                                                                     
# 1 Samsung ""                   ""    type     private                                                                   
# 2 Samsung ""                   ""    based    South Korea                                                               
# 3 Samsung Harman International ""    type     private                                                                   
# 4 Samsung Harman International ""    based    United States                                                             
# 5 Samsung Harman International ""    industry Electronics                                                               
# 6 Samsung Harman International JBL   type     subsidiary                                                                
# 7 Samsung Harman International JBL   based    United States                                                             
# 8 Samsung Harman International JBL   industry Audio                                                                     
# 9 Amazaon ""                   ""    type     public                                                                    
# 10 Amazaon ""                   ""    based    United States                                                             
# 11 Amazaon ""                   ""    industry Cloud computing, e-commerce, artificial intelligence, consumer electronics

处理亚马逊的多个行业

最后,让 str_split 和 unnes 那些 'industry' 亚马逊的价值。

finally <- further %>% 
  mutate(VALUE = str_split(VALUE, ",\s*")) %>% 
  unnest()


# A tibble: 14 x 5
# LVL.1   LVL.2                LVL.3 ATTR     VALUE                  
# <chr>   <chr>                <chr> <chr>    <chr>                  
# 1 Samsung ""                   ""    type     private                
# 2 Samsung ""                   ""    based    South Korea            
# 3 Samsung Harman International ""    type     private                
# 4 Samsung Harman International ""    based    United States          
# 5 Samsung Harman International ""    industry Electronics            
# 6 Samsung Harman International JBL   type     subsidiary             
# 7 Samsung Harman International JBL   based    United States          
# 8 Samsung Harman International JBL   industry Audio                  
# 9 Amazaon ""                   ""    type     public                 
# 10 Amazaon ""                   ""    based    United States          
# 11 Amazaon ""                   ""    industry Cloud computing        
# 12 Amazaon ""                   ""    industry e-commerce             
# 13 Amazaon ""                   ""    industry artificial intelligence
# 14 Amazaon ""                   ""    industry consumer electronics   

Q.E.D.

LAGNAPPE

further %>% 
  spread(key = "ATTR", value = "VALUE") %>% 
  mutate(industry = str_split(industry, ",\s*")) %>% 
  unnest()
# A tibble: 7 x 6
  LVL.1   LVL.2                LVL.3 based         type       industry               
  <chr>   <chr>                <chr> <chr>         <chr>      <chr>                  
1 Amazaon ""                   ""    United States public     Cloud computing        
2 Amazaon ""                   ""    United States public     e-commerce             
3 Amazaon ""                   ""    United States public     artificial intelligence
4 Amazaon ""                   ""    United States public     consumer electronics   
5 Samsung ""                   ""    South Korea   private    NA                     
6 Samsung Harman International ""    United States private    Electronics            
7 Samsung Harman International JBL   United States subsidiary Audio      

利用最后注释中的内容,计算每个公司行开头的space,并使用gsubfn 将它们替换为给出L2 的级别编号。然后在删除前导 space 后,用冒号替换每行的第一个 space,给出 L3。该文件现在是 dcf 格式,因此请使用 read.dcf 给出 L4.

来读取它

现在生成一个 lv 变量,以数字形式给出级别编号,并为每一行生成连续的数字 ID。计算给定父级的父级 ID,然后用我们目前计算的内容构建一个数据框。树的总根将由 0 表示。从 DF 生成图的边列表 e,并将其转换为 igraph。从中生成简单路径并创建一个数据框 DF2,其中包含列路径、公司、类型、基础和行业,这样每一行代表除根以外的一个节点。

如果您愿意,可以将 lv 和 parent 添加到我们计算但未添加的数据框中,因为您可能不需要它们。

下面假设每次缩进4space秒。

关卡的深度没有限制。

我们可以使用数据框操作搜索 DF2 以进行各种基于文本的查询,例如

subset(DF2, grepl("Samsung", paths))  # Samsung and its descendents

或者我们可以使用 igraph 函数对 g 进行图形查询,例如

max(length(get.diameter(g))) - 1   # max depth not counting root

或者我们可以使用 data.tree 函数进行查询

dt$height -  1  # max depth not counting root

代码

代码如下。

library(gsubfn)

content <- readLines(textConnection(Lines))
L2 <- gsubfn("( *)company", ~ paste0("level ", nchar(x) / 4L + 1L, "\ncompany"), content)
L3 <- sub(" ", ":", trimws(readLines(textConnection(L2))))
L4 <- read.dcf(textConnection(L3))
lv <- as.numeric(L4[, 1])
id <- seq_along(lv)
company <- L4[, "company"]
parent <- sapply(id, function(i) c(tail(which(lv[1:i] < lv[i]), 1), 0)[1])  

DF <- data.frame(id = company[id], parent = c("0", company)[parent+1], 
  level = lv, L4[, -1], stringsAsFactors = FALSE)
e <- with(DF, cbind(parent, id))

igraph

现在我们有了一个边列表,我们可以创建一个 igraph 并使用该包对其进行处理。

library(igraph)

g <- graph_from_edgelist(e)

p <- all_simple_paths(g, "0")
paths <- sapply(p, function(x) paste(names(x), collapse = "/"))
DF2 <- data.frame(paths, L4[, -1], stringsAsFactors = FALSE)
DF2

给出路径列,后跟每个节点的属性:

                               paths              company       type         based                                                                   industry
1                          0/Samsung              Samsung    private   South Korea                                                                       <NA>
2     0/Samsung/Harman International Harman International    private United States                                                                Electronics
3 0/Samsung/Harman International/JBL                  JBL subsidiary United States                                                                      Audio
4                          0/Amazaon              Amazaon     public United States Cloud computing, e-commerce, artificial intelligence, consumer electronics

我们可以这样绘制图表:

plot(g, layout = layout_as_tree(g))

(接图)

data.tree

我们也可以使用 data.tree 及其许多函数来处理这个:

library(data.tree)
library(DiagrammeR)

dt <- FromDataFrameNetwork(DF)
print(dt, "type", "based", "industry")

给予:

                     levelName       type         based                                                                   industry
1 0                                                                                                                               
2  ¦--Samsung                     private   South Korea                                                                           
3  ¦   °--Harman International    private United States                                                                Electronics
4  ¦       °--JBL              subsidiary United States                                                                      Audio
5  °--Amazaon                      public United States Cloud computing, e-commerce, artificial intelligence, consumer electronics

我们可以将数据树数据绘制或转换如下

plot(dt)  # plot in browser
ToListSimple(dt) # convert to nested list
ToListExplicit(dt) # similar but children in children component

备注

我们可以像这样可重复地创建内容:

Lines <- "
company Samsung
type private
based South Korea

    company Harman International
    type private
    based United States
    industry Electronics

        company JBL
        type subsidiary
        based United States
        industry Audio

company Amazaon
type public
based United States
industry Cloud computing, e-commerce, artificial intelligence, consumer electronics"

content <- readLines(textConnection(Lines))