网络抓取口袋妖怪数据
Webscraping Pokemon Data
我想知道每个口袋妖怪(第一代)可以学习的招数。
我发现以下网站包含此信息:https://pokemondb.net/pokedex/game/red-blue-yellow
这里列出了 151 个口袋妖怪 - 对于每个口袋妖怪,他们的移动集都列在模板页面上,如下所示:https://pokemondb.net/pokedex/bulbasaur/moves/1
由于我使用的是 R,因此我尝试获取这 150 个口袋妖怪 (https://docs.google.com/document/d/1fH_n_BPbIk1bZCrK1hLAJrYPH2d5RTy9IgdR5Ck_lNw/edit#) 中每一个的网站地址:
names = c("Bulbasaur","Ivysaur","Venusaur","Charmander","Charmeleon","Charizard","Squirtle","Wartortle","Blastoise","Caterpie","Metapod","Butterfree","Weedle","Kakuna","Beedrill",
"Pidgey","Pidgeotto","Pidgeot","Rattata","Raticate","Spearow","Fearow","Ekans","Arbok","Pikachu","Raichu","Sandshrew","Sandslash","Nidoran","Nidorina","Nidoqueen","Nidorino","Nidoking",
"Clefairy","Clefable","Vulpix","Ninetales","Jigglypuff","Wigglytuff","Zubat","Golbat","Oddish","Gloom","Vileplume","Paras","Parasect","Venonat","Venomoth","Diglett","Dugtrio","Meowth","Persian",
"Psyduck","Golduck","Mankey","Primeape","Growlithe","Arcanine","Poliwag","Poliwhirl","Poliwrath","Abra","Kadabra","Alakazam","Machop","Machoke","Machamp","Bellsprout","Weepinbell","Victreebel","Tentacool",
"Tentacruel","Geodude","Graveler","Golem","Ponyta","Rapidash","Slowpoke","Slowbro","Magnemite","Magneton","Farfetch’d","Doduo","Dodrio","Seel","Dewgong","Grimer","Muk","Shellder","Cloyster","Gastly","Haunter",
"Gengar","Onix","Drowzee","Hypno","Krabby","Kingler","Voltorb","Electrode","Exeggcute","Exeggutor","Cubone","Marowak","Hitmonlee","Hitmonchan","Lickitung","Koffing","Weezing","Rhyhorn","Rhydon","Chansey","Tangela",
"Kangaskhan","Horsea","Seadra","Goldeen","Seaking","Staryu","Starmie","Mr.Mime","Scyther","Jynx","Electabuzz","Magmar","Pinsir","Tauros","Magikarp","Gyarados","Lapras","Ditto"
,"Eevee","Vaporeon","Jolteon","Flareon","Porygon","Omanyte","Omastar","Kabuto","Kabutops","Aerodactyl","Snorlax","Articuno","Zapdos","Moltres","Dratini","Dragonair","Dragonite","Mewtwo","Mew")
template_1 = rep("https://pokemondb.net/pokedex/",150)
template_2 = rep("/moves/1",150)
pokemon_websites = data.frame(template_1, names, template_2)
pokemon_websites$full_website = paste(pokemon_websites$template_1, pokemon_websites$names, pokemon_websites$template_2)
接下来,我删除所有空格:
library(stringr)
pokemon_websites$full_website = str_remove_all( pokemon_websites$full_website," ")
现在,我有一个包含所有网站名称的列:
head(pokemon_websites)
template_1 names template_2 full_website
1 https://pokemondb.net/pokedex/ Bulbasaur /moves/1 https://pokemondb.net/pokedex/Bulbasaur/moves/1
2 https://pokemondb.net/pokedex/ Ivysaur /moves/1 https://pokemondb.net/pokedex/Ivysaur/moves/1
3 https://pokemondb.net/pokedex/ Venusaur /moves/1 https://pokemondb.net/pokedex/Venusaur/moves/1
4 https://pokemondb.net/pokedex/ Charmander /moves/1 https://pokemondb.net/pokedex/Charmander/moves/1
5 https://pokemondb.net/pokedex/ Charmeleon /moves/1 https://pokemondb.net/pokedex/Charmeleon/moves/1
6 https://pokemondb.net/pokedex/ Charizard /moves/1 https://pokemondb.net/pokedex/Charizard/moves/1
我想数一数这150只神奇宝贝每只能学会多少招式。比如第一只神奇宝贝“妙蛙种子”可以学会24招:
最后,我想在前面的数据框中添加一列,其中包含每个口袋妖怪可以学习的动作数。例如,看起来像这样的东西:
> head(pokemon_websites)
template_1 names template_2 full_website number_of_moves
1 https://pokemondb.net/pokedex/ Bulbasaur /moves/1 https://pokemondb.net/pokedex/Bulbasaur/moves/1 24
2 https://pokemondb.net/pokedex/ Ivysaur /moves/1 https://pokemondb.net/pokedex/Ivysaur/moves/1 ???
3 https://pokemondb.net/pokedex/ Venusaur /moves/1 https://pokemondb.net/pokedex/Venusaur/moves/1 ???
4 https://pokemondb.net/pokedex/ Charmander /moves/1 https://pokemondb.net/pokedex/Charmander/moves/1 ???
5 https://pokemondb.net/pokedex/ Charmeleon /moves/1 https://pokemondb.net/pokedex/Charmeleon/moves/1 ???
6 https://pokemondb.net/pokedex/ Charizard /moves/1 https://pokemondb.net/pokedex/Charizard/moves/1 ???
- 有没有办法在 R 中对这些数据进行网络抓取,计算 150 只神奇宝贝中每只的移动次数,然后将此移动计数放入一列中?
现在我正在手工做这个,需要很长时间!另外,我听说有些网站不允许自动网页抓取 - 如果这个网站 (https://pokemondb.net/pokedex/game/red-blue-yellow) 不允许网页抓取,我可以尝试找到另一个可能允许的网站。
谢谢!
您可以使用如下方式为每个口袋妖怪抓取所有表格:
tables =lapply(pokemon_websites$full_website,function(link) {
tryCatch(
read_html(link) %>% html_nodes("table") %>% html_table(),
error = function(e) {}, warning=function(w) {}
)
})
但是,请注意返回的表格数量因每个口袋妖怪而异。例如,第一个有 6 个表 - 前三个用于 Red/Blue,后三个用于黄色。
lengths(tables)
[1] 6 6 6 6 6 6 6 6 6 2 4 7 2 4 8 6 6 6 4 4 6 6 6 6 6 8 6 6 0 4 8 4 8 6 8 4 6 6 8 4 4 6 6 8 6 6 5 5 5 5 4 4 6 6 6
[56] 6 4 6 6 6 8 6 6 6 6 6 6 6 6 8 6 6 6 6 6 4 4 6 6 6 6 0 6 6 6 6 4 4 6 8 4 4 6 6 6 6 6 6 6 6 4 8 6 7 6 6 6 4 4 6
[111] 6 6 6 6 6 6 6 6 6 8 0 6 4 6 6 6 6 2 8 6 2 4 8 8 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
由于 OP 只想计算 Red/Blue
选项卡中的移动,我们可以执行以下操作,(如果您需要两个选项卡中的移动,请遵循@langtang 的回答)
tables1 =lapply(pokemon_websites$full_website, function(x){
tryCatch( x %>% read_html() %>% html_nodes('.active') %>% html_nodes('.resp-scroll') %>% html_table(),
error = function(e) NULL
)
})
moves= lapply(tables1, function(x) lapply(x, function(x) dim(x)[1]))
moves = lapply(moves, unlist, use.names=FALSE)
moves = lapply(moves, sum) %>% unlist()
[1] 24 25 27 32 33 37 32 33 37 2 3 30 2 3 26 22 23 25 24 27 21 23 24 26 29 30 27 29 0 28 43 28 44 41 42 22 23 40 41 19 22 21 23 25 23 26 22 29 20 23 24
[52] 27 31 34 31 34 23 25 25 36 37 25 34 35 29 31 32 23 24 26 28 31 30 31 33 22 26 37 46 23 26 0 23 26 25 28 22 24 27 29 20 20 32 25 33 36 25 27 24 27 24 29
[103] 32 35 26 26 37 19 21 27 42 44 24 36 22 24 25 27 33 34 0 21 34 35 30 23 27 2 34 34 1 19 32 32 28 30 22 28 22 30 24 43 25 25 22 30 32 36 45 60
我想知道每个口袋妖怪(第一代)可以学习的招数。
我发现以下网站包含此信息:https://pokemondb.net/pokedex/game/red-blue-yellow
这里列出了 151 个口袋妖怪 - 对于每个口袋妖怪,他们的移动集都列在模板页面上,如下所示:https://pokemondb.net/pokedex/bulbasaur/moves/1
由于我使用的是 R,因此我尝试获取这 150 个口袋妖怪 (https://docs.google.com/document/d/1fH_n_BPbIk1bZCrK1hLAJrYPH2d5RTy9IgdR5Ck_lNw/edit#) 中每一个的网站地址:
names = c("Bulbasaur","Ivysaur","Venusaur","Charmander","Charmeleon","Charizard","Squirtle","Wartortle","Blastoise","Caterpie","Metapod","Butterfree","Weedle","Kakuna","Beedrill",
"Pidgey","Pidgeotto","Pidgeot","Rattata","Raticate","Spearow","Fearow","Ekans","Arbok","Pikachu","Raichu","Sandshrew","Sandslash","Nidoran","Nidorina","Nidoqueen","Nidorino","Nidoking",
"Clefairy","Clefable","Vulpix","Ninetales","Jigglypuff","Wigglytuff","Zubat","Golbat","Oddish","Gloom","Vileplume","Paras","Parasect","Venonat","Venomoth","Diglett","Dugtrio","Meowth","Persian",
"Psyduck","Golduck","Mankey","Primeape","Growlithe","Arcanine","Poliwag","Poliwhirl","Poliwrath","Abra","Kadabra","Alakazam","Machop","Machoke","Machamp","Bellsprout","Weepinbell","Victreebel","Tentacool",
"Tentacruel","Geodude","Graveler","Golem","Ponyta","Rapidash","Slowpoke","Slowbro","Magnemite","Magneton","Farfetch’d","Doduo","Dodrio","Seel","Dewgong","Grimer","Muk","Shellder","Cloyster","Gastly","Haunter",
"Gengar","Onix","Drowzee","Hypno","Krabby","Kingler","Voltorb","Electrode","Exeggcute","Exeggutor","Cubone","Marowak","Hitmonlee","Hitmonchan","Lickitung","Koffing","Weezing","Rhyhorn","Rhydon","Chansey","Tangela",
"Kangaskhan","Horsea","Seadra","Goldeen","Seaking","Staryu","Starmie","Mr.Mime","Scyther","Jynx","Electabuzz","Magmar","Pinsir","Tauros","Magikarp","Gyarados","Lapras","Ditto"
,"Eevee","Vaporeon","Jolteon","Flareon","Porygon","Omanyte","Omastar","Kabuto","Kabutops","Aerodactyl","Snorlax","Articuno","Zapdos","Moltres","Dratini","Dragonair","Dragonite","Mewtwo","Mew")
template_1 = rep("https://pokemondb.net/pokedex/",150)
template_2 = rep("/moves/1",150)
pokemon_websites = data.frame(template_1, names, template_2)
pokemon_websites$full_website = paste(pokemon_websites$template_1, pokemon_websites$names, pokemon_websites$template_2)
接下来,我删除所有空格:
library(stringr)
pokemon_websites$full_website = str_remove_all( pokemon_websites$full_website," ")
现在,我有一个包含所有网站名称的列:
head(pokemon_websites)
template_1 names template_2 full_website
1 https://pokemondb.net/pokedex/ Bulbasaur /moves/1 https://pokemondb.net/pokedex/Bulbasaur/moves/1
2 https://pokemondb.net/pokedex/ Ivysaur /moves/1 https://pokemondb.net/pokedex/Ivysaur/moves/1
3 https://pokemondb.net/pokedex/ Venusaur /moves/1 https://pokemondb.net/pokedex/Venusaur/moves/1
4 https://pokemondb.net/pokedex/ Charmander /moves/1 https://pokemondb.net/pokedex/Charmander/moves/1
5 https://pokemondb.net/pokedex/ Charmeleon /moves/1 https://pokemondb.net/pokedex/Charmeleon/moves/1
6 https://pokemondb.net/pokedex/ Charizard /moves/1 https://pokemondb.net/pokedex/Charizard/moves/1
我想数一数这150只神奇宝贝每只能学会多少招式。比如第一只神奇宝贝“妙蛙种子”可以学会24招:
最后,我想在前面的数据框中添加一列,其中包含每个口袋妖怪可以学习的动作数。例如,看起来像这样的东西:
> head(pokemon_websites)
template_1 names template_2 full_website number_of_moves
1 https://pokemondb.net/pokedex/ Bulbasaur /moves/1 https://pokemondb.net/pokedex/Bulbasaur/moves/1 24
2 https://pokemondb.net/pokedex/ Ivysaur /moves/1 https://pokemondb.net/pokedex/Ivysaur/moves/1 ???
3 https://pokemondb.net/pokedex/ Venusaur /moves/1 https://pokemondb.net/pokedex/Venusaur/moves/1 ???
4 https://pokemondb.net/pokedex/ Charmander /moves/1 https://pokemondb.net/pokedex/Charmander/moves/1 ???
5 https://pokemondb.net/pokedex/ Charmeleon /moves/1 https://pokemondb.net/pokedex/Charmeleon/moves/1 ???
6 https://pokemondb.net/pokedex/ Charizard /moves/1 https://pokemondb.net/pokedex/Charizard/moves/1 ???
- 有没有办法在 R 中对这些数据进行网络抓取,计算 150 只神奇宝贝中每只的移动次数,然后将此移动计数放入一列中?
现在我正在手工做这个,需要很长时间!另外,我听说有些网站不允许自动网页抓取 - 如果这个网站 (https://pokemondb.net/pokedex/game/red-blue-yellow) 不允许网页抓取,我可以尝试找到另一个可能允许的网站。
谢谢!
您可以使用如下方式为每个口袋妖怪抓取所有表格:
tables =lapply(pokemon_websites$full_website,function(link) {
tryCatch(
read_html(link) %>% html_nodes("table") %>% html_table(),
error = function(e) {}, warning=function(w) {}
)
})
但是,请注意返回的表格数量因每个口袋妖怪而异。例如,第一个有 6 个表 - 前三个用于 Red/Blue,后三个用于黄色。
lengths(tables)
[1] 6 6 6 6 6 6 6 6 6 2 4 7 2 4 8 6 6 6 4 4 6 6 6 6 6 8 6 6 0 4 8 4 8 6 8 4 6 6 8 4 4 6 6 8 6 6 5 5 5 5 4 4 6 6 6
[56] 6 4 6 6 6 8 6 6 6 6 6 6 6 6 8 6 6 6 6 6 4 4 6 6 6 6 0 6 6 6 6 4 4 6 8 4 4 6 6 6 6 6 6 6 6 4 8 6 7 6 6 6 4 4 6
[111] 6 6 6 6 6 6 6 6 6 8 0 6 4 6 6 6 6 2 8 6 2 4 8 8 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
由于 OP 只想计算 Red/Blue
选项卡中的移动,我们可以执行以下操作,(如果您需要两个选项卡中的移动,请遵循@langtang 的回答)
tables1 =lapply(pokemon_websites$full_website, function(x){
tryCatch( x %>% read_html() %>% html_nodes('.active') %>% html_nodes('.resp-scroll') %>% html_table(),
error = function(e) NULL
)
})
moves= lapply(tables1, function(x) lapply(x, function(x) dim(x)[1]))
moves = lapply(moves, unlist, use.names=FALSE)
moves = lapply(moves, sum) %>% unlist()
[1] 24 25 27 32 33 37 32 33 37 2 3 30 2 3 26 22 23 25 24 27 21 23 24 26 29 30 27 29 0 28 43 28 44 41 42 22 23 40 41 19 22 21 23 25 23 26 22 29 20 23 24
[52] 27 31 34 31 34 23 25 25 36 37 25 34 35 29 31 32 23 24 26 28 31 30 31 33 22 26 37 46 23 26 0 23 26 25 28 22 24 27 29 20 20 32 25 33 36 25 27 24 27 24 29
[103] 32 35 26 26 37 19 21 27 42 44 24 36 22 24 25 27 33 34 0 21 34 35 30 23 27 2 34 34 1 19 32 32 28 30 22 28 22 30 24 43 25 25 22 30 32 36 45 60