如何清理变量名和属性在同一个单元格中的数据？

Question

我需要清理数据，其中变量属性和与位置相关的答案都在一个单元格中。我的数据集中唯一一致的是它们由冒号 (:) 分隔。我需要将数据重新映射到变量属性成为列 header 并为每个位置映射数据。

我附上了一个例子：

也可以有一堆不相关的其他符号。我只需要提取冒号前的字符串和冒号后的字符串或整数，它就会正确映射到每个位置。

我如何在 R 上执行此操作？我应该使用什么功能

示例数据：

Example1    Sunny:"TRUE"    NearCoast:False Schools:{"13"} 2
Example2    NearCoast:False Schools:{"6"}   Sunny:"FALSE" 3
Example3    Schools:{"2"}   Sunny:"TRUE"    NearCoast:TRUE Transport:5

此外，我是否可以在此过程中添加例外情况。例如，如果单元格只是一个数字，则会被忽略。或者，如果属性名称是一个特定的名称，例如 "transport"，它也会忽略该单元格。

Answer 1

由于缺乏可重现的例子，我只能提供指导。假设您可以如第二张图片所示以表格方式读取数据，则可以使用 dplyr 和 tidyr 包通过 4 "simple" 步骤完成：

library(dplyr)
library(tidyr)

df <- read.table(...)

df %>% gather(keypair, column, 2:4) %>%
  separate(keypair, into=c('key','value'), sep=':') %>%
  mutate(value=gsub('"{}', '', value)) %>%
  spread(key, value)

逐行检查每一行，并在尝试运行下一行之前尝试了解发生了什么。

Answer 2

试试这个例子，正如评论中提到的，我们可以从宽到长整形，然后在 : 上拆分字符串，然后再从长到宽整形。

df1 <- read.table(text = '
Example1    Sunny:"TRUE"    NearCoast:False Schools:{"13"} 2
Example2    NearCoast:False Schools:{"6"}   Sunny:"FALSE" 3
Example3    Schools:{"2"}   Sunny:"TRUE"    NearCoast:TRUE Transport:5',
                  header = FALSE, stringsAsFactors = FALSE)


library(tidyverse)

gather(df1, key = "k", value = "v", -V1) %>% 
  separate(v, into = c("type", "value"), sep = ":") %>% 
  filter(!is.na(value)) %>% 
  select(-k) %>% 
  spread(key = type, value = value)

#         V1 NearCoast Schools   Sunny Transport
# 1 Example1     False  {"13"}  "TRUE"      <NA>
# 2 Example2     False   {"6"} "FALSE"      <NA>
# 3 Example3      TRUE   {"2"}  "TRUE"         5

如何清理变量名和属性在同一个单元格中的数据？

How to clean data where the variable name and property are in the same cell?

r

data-cleaning