未排序的 Json 文件到数据框 Rfv

Unsorted Json file to data frame Rfv

我正在尝试以 bz2 格式导入压缩的 json 文件,并将其转换为数据框(link 到下面的文件和 dput 示例)。我使用这些代码行有些成功

library(jsonlite)
out <- lapply(readLines("RC_2005-12.bz2"), fromJSON)
df <- data.frame(matrix(unlist(out), nrow = length(out), byrow = T))

out 是命名条目的嵌套列表。但是,这些命名条目没有按顺序排列,因此 df 中的列变成了不同条目的组合。 如果我们使用下面的 dput 示例,那么在第一个列表中 controversiality 是第一个条目,而 utc_created 是第二个名单。这导致 df 中的第一列看起来像:

X1
0
1134365725

这当然应该是一列两个零,对应于每个子列表的争议性。我如何 order/sort/regularize 子列表以便列匹配?或者,当我将列表转换为 df 时,如何使用匹配的名称作为条件?

完整数据文件 RC_2005-12.bz2 可在 http://files.pushshift.io/reddit/comments/

下面 out 的前两个子列表:

 list(structure(list(controversiality = 0, body = "A look at Vietnam and Mexico exposes the myth of market liberalisation.", 
subreddit_id = "t5_6", link_id = "t3_17863", stickied = FALSE, 
subreddit = "reddit.com", score = 2, ups = 2, author_flair_css_class = NULL, 
created_utc = 1134365188, author_flair_text = NULL, author = "frjo", 
id = "c13", edited = FALSE, parent_id = "t3_17863", gilded = 0, 
distinguished = NULL, retrieved_on = 1473738411), .Names = c("controversiality", "body", "subreddit_id", "link_id", "stickied", "subreddit", "score", "ups", "author_flair_css_class", "created_utc", "author_flair_text", "author", "id", "edited", "parent_id", "gilded", "distinguished", "retrieved_on")), structure(list(created_utc = 1134365725, author_flair_css_class = NULL, score = 1, ups = 1, subreddit = "reddit.com", stickied = FALSE, link_id = "t3_17866", subreddit_id = "t5_6", controversiality = 0, body = "The site states \"What can I use it for? Meeting notes, Reports, technical specs Sign-up sheets, proposals and much more...\", just like any other new breeed of sites that want us to store everything we have on the web. And they even guarantee multiple levels of security and encryption etc. But what prevents these web site operators fom accessing and/or stealing Meeting notes, Reports, technical specs Sign-up sheets, proposals and much more, for competitive or personal gains...? I am pretty sure that most of them are honest, but what's there to prevent me from setting up a good useful site and stealing all your data? Call me paranoid - I am.", 
retrieved_on = 1473738411, distinguished = NULL, gilded = 0, 
id = "c14", edited = FALSE, parent_id = "t3_17866", author = "zse7zse", 
author_flair_text = NULL), .Names = c("created_utc", "author_flair_css_class", "score", "ups", "subreddit", "stickied", "link_id", "subreddit_id", controversiality", "body", "retrieved_on", "distinguished", "gilded", "id", "edited", "parent_id", "author", "author_flair_text" )))

您的文件似乎每行都有一个对象。我们可以稍微修改您的 JSONs 以创建单个 JSON 数组并让 jsonlite::fromJSON 完成脏活。类似于:

require(jsonlite)
lines<-paste0("[",paste(readLines("RC_2005-12.bz2"),collapse=","),"]")
fromJSON(lines)
#'data.frame':  1075 obs. of  18 variables:
# $ controversiality      : int  0 0 0 0 0 0 0 0 0 0 ...
#...

来自 corpusread_ndjson 函数不关心字段出现的顺序:

data <- corpus::read_ndjson(bzfile("RC_2005-12.bz2"))

需要修复的无关问题:

看起来制作这个文件的人做错了。它是用 UTF-8 编码的,但他们认为它是 Latin-1。看,例如记录 8:

data$body[8]
#> [1] "I donâ\u0080\u0099t know where they came up with this stuff, but Qube Web Search Client has taken the market by surprise. This is a cool concept thatâ\u0080\u0099s just beginning to blossom. You can save time by copying and pasting words and phrases."

通过首先撤消他们认为是 Latin-1 到 UTF-8 的转换来修复它:

body <- iconv(data$body, "UTF-8", "Latin1")

然后设置正确的编码:

Encoding(body) <- "UTF-8"

检查结果:

body[8]
#> [1] "I don’t know where they came up with this stuff, but Qube Web Search Client has taken the market by surprise. This is a cool concept that’s just beginning to blossom. You can save time by copying and pasting words and phrases."

确保有效:

all(utf8::utf8_valid(body))
#> TRUE

改回数据:

data$body <- body

您数据中的其他字段可能需要相同的字段。