R: Web scraping JSON, 从嵌套中提取信息
R: Web scraping JSON, extracting information from nest
我正在尝试使用 tidyJSON 从 JSON 中提取信息,但我愿意接受任何可以实现我的目的的 R 包。我查看了文档和小插曲,发现 complex example 很有帮助。但是,我想要的信息嵌套在一个非键值对中,我不确定如何访问它。我有兴趣获取 appid
、name
、developer
等,但此信息在 570
和 730
内:
{"570":{"appid":570,"name":"Dota 2","developer":"Valve","publisher":"Valve","score_rank":71,"owners":102151578,"owners_variance":259003,"players_forever":102151578,"players_forever_variance":259003,"players_2weeks":9436299,"players_2weeks_variance":89979,"average_forever":11727,"average_2weeks":1229,"median_forever":277,"median_2weeks":662,"ccu":811259,"price":"0","tags":{"Free to Play":22678,"MOBA":7808,"Strategy":7415,"Multiplayer":6757,"Team-Based":4848,"Action":4602,"e-sports":4089,"Online Co-Op":3669,"Competitive":3553,"PvP":2655,"RTS":2267,"Difficult":2129,"RPG":2114,"Fantasy":2044,"Tower Defense":2024,"Co-op":1898,"Character Customization":1514,"Replay Value":1487,"Action RPG":1397,"Simulation":1024}},
"730":{"appid":730,"name":"Counter-Strike: Global Offensive","developer":"Valve","publisher":"Valve","score_rank":78,"owners":29225079,"owners_variance":154335,"players_forever":28552354,"players_forever_variance":152685,"players_2weeks":9102348,"players_2weeks_variance":88410,"average_forever":17648,"average_2weeks":791,"median_forever":5030,"median_2weeks":358,"ccu":543626,"price":"1499","tags":{"FPS":17082,"Multiplayer":13744,"Shooter":12833,"Action":10881,"Team-Based":10369,"Competitive":9664,"Tactical":8529,"First-Person":7329,"e-sports":6716,"PvP":6383,"Online Co-Op":5714,"Military":4621,"Co-op":4435,"Strategy":4424,"War":4361,"Realistic":3196,"Trading":3191,"Difficult":3158,"Fast-Paced":3100,"Moddable":2496}}
这样的条目数以千计。有没有办法跳过 "top-level" 并在嵌套内查看?
JSON 信息来自 http://steamspy.com/api.php?request=top100in2weeks
这可能是您需要的:
library(jsonlite)
data = fromJSON("http://steamspy.com/api.php?request=top100in2weeks")
appid = lapply(data, function(x){x$appid})
name = lapply(data, function(x){x$name})
df = data.frame(appid = unlist(appid),
name = unlist(name),
stringsAsFactors = F)
结果:
> head(df)
appid name
570 570 Dota 2
730 730 Counter-Strike: Global Offensive
578080 578080 PLAYERUNKNOWN'S BATTLEGROUNDS
440 440 Team Fortress 2
271590 271590 Grand Theft Auto V
433850 433850 H1Z1: King of the Kill
我会让你添加其余的信息
编辑:将数组添加到数据框
可以在数据框中添加每个游戏的标签信息。时间也被标记了。对于每个游戏,您必须在一列中存储标签名称数组,在另一列中存储标签数量。
在 df
的定义之后添加以下行:
for(k in 1:nrow(d)){
d$tags[k] = list(names(data[[k]]$tags))
d$tagsQ[k] = list(unlist(data[[k]]$tags))
}
这会给你:
> d["570",]
appid name
570 570 Dota 2
tags
570 Free to Play, MOBA, Strategy, Multiplayer, Team-Based, Action, e-sports, Online Co-Op, Competitive, PvP, RTS, Difficult, RPG, Fantasy, Tower Defense, Co-op, Character Customization, Replay Value, Action RPG, Simulation
tagsQ
570 22686, 7810, 7420, 6759, 4850, 4603, 4092, 3672, 3555, 2657, 2267, 2130, 2116, 2045, 2024, 1898, 1514, 1487, 1397, 1023
在这种情况下,列 tags
和 tagsQ
包含列表。要获取 appid 570
的第二个标签和数量,请执行以下操作:
> df["570","tags"][[1]][2]
[1] "MOBA"
> d["570","tagsQ"][[1]][2]
MOBA
7810
我正在尝试使用 tidyJSON 从 JSON 中提取信息,但我愿意接受任何可以实现我的目的的 R 包。我查看了文档和小插曲,发现 complex example 很有帮助。但是,我想要的信息嵌套在一个非键值对中,我不确定如何访问它。我有兴趣获取 appid
、name
、developer
等,但此信息在 570
和 730
内:
{"570":{"appid":570,"name":"Dota 2","developer":"Valve","publisher":"Valve","score_rank":71,"owners":102151578,"owners_variance":259003,"players_forever":102151578,"players_forever_variance":259003,"players_2weeks":9436299,"players_2weeks_variance":89979,"average_forever":11727,"average_2weeks":1229,"median_forever":277,"median_2weeks":662,"ccu":811259,"price":"0","tags":{"Free to Play":22678,"MOBA":7808,"Strategy":7415,"Multiplayer":6757,"Team-Based":4848,"Action":4602,"e-sports":4089,"Online Co-Op":3669,"Competitive":3553,"PvP":2655,"RTS":2267,"Difficult":2129,"RPG":2114,"Fantasy":2044,"Tower Defense":2024,"Co-op":1898,"Character Customization":1514,"Replay Value":1487,"Action RPG":1397,"Simulation":1024}},
"730":{"appid":730,"name":"Counter-Strike: Global Offensive","developer":"Valve","publisher":"Valve","score_rank":78,"owners":29225079,"owners_variance":154335,"players_forever":28552354,"players_forever_variance":152685,"players_2weeks":9102348,"players_2weeks_variance":88410,"average_forever":17648,"average_2weeks":791,"median_forever":5030,"median_2weeks":358,"ccu":543626,"price":"1499","tags":{"FPS":17082,"Multiplayer":13744,"Shooter":12833,"Action":10881,"Team-Based":10369,"Competitive":9664,"Tactical":8529,"First-Person":7329,"e-sports":6716,"PvP":6383,"Online Co-Op":5714,"Military":4621,"Co-op":4435,"Strategy":4424,"War":4361,"Realistic":3196,"Trading":3191,"Difficult":3158,"Fast-Paced":3100,"Moddable":2496}}
这样的条目数以千计。有没有办法跳过 "top-level" 并在嵌套内查看?
JSON 信息来自 http://steamspy.com/api.php?request=top100in2weeks
这可能是您需要的:
library(jsonlite)
data = fromJSON("http://steamspy.com/api.php?request=top100in2weeks")
appid = lapply(data, function(x){x$appid})
name = lapply(data, function(x){x$name})
df = data.frame(appid = unlist(appid),
name = unlist(name),
stringsAsFactors = F)
结果:
> head(df)
appid name
570 570 Dota 2
730 730 Counter-Strike: Global Offensive
578080 578080 PLAYERUNKNOWN'S BATTLEGROUNDS
440 440 Team Fortress 2
271590 271590 Grand Theft Auto V
433850 433850 H1Z1: King of the Kill
我会让你添加其余的信息
编辑:将数组添加到数据框
可以在数据框中添加每个游戏的标签信息。时间也被标记了。对于每个游戏,您必须在一列中存储标签名称数组,在另一列中存储标签数量。
在 df
的定义之后添加以下行:
for(k in 1:nrow(d)){
d$tags[k] = list(names(data[[k]]$tags))
d$tagsQ[k] = list(unlist(data[[k]]$tags))
}
这会给你:
> d["570",]
appid name
570 570 Dota 2
tags
570 Free to Play, MOBA, Strategy, Multiplayer, Team-Based, Action, e-sports, Online Co-Op, Competitive, PvP, RTS, Difficult, RPG, Fantasy, Tower Defense, Co-op, Character Customization, Replay Value, Action RPG, Simulation
tagsQ
570 22686, 7810, 7420, 6759, 4850, 4603, 4092, 3672, 3555, 2657, 2267, 2130, 2116, 2045, 2024, 1898, 1514, 1487, 1397, 1023
在这种情况下,列 tags
和 tagsQ
包含列表。要获取 appid 570
的第二个标签和数量,请执行以下操作:
> df["570","tags"][[1]][2]
[1] "MOBA"
> d["570","tagsQ"][[1]][2]
MOBA
7810