路透社使用 rvest 在 R 中抓取数据,找到 CSS 选择器
reuters data scraping in R with rvest, find CSS selector
是的,我知道有类似的问题,我已经阅读了答案并尝试了我可以实现的答案。所以,如果这个问题很愚蠢,请提前道歉:)
我正在从路透社抓取公司董事会成员的年龄以获取公司名单。
这是 link:http://www.reuters.com/finance/stocks/companyOfficers?symbol=MSFT
我正在使用 rvest 库和 selectorgadget 来查找合适的 CSS selector。
这是代码:
library(rvest)
d = read_html("http://www.reuters.com/finance/stocks/companyOfficers?symbol=GAZP.RTS")
d %>% html_nodes("#companyNews:nth-child(1) td:nth-child(2)") %>% html_text()
结果是
character(0)
我觉得我说错了CSSselect或者。你能告诉我如何 select table 吗?
您需要使用 html_session
才能正确加载数据:
library(rvest)
url <- 'http://www.reuters.com/finance/stocks/companyOfficers?symbol=MSFT.O'
site <- html_session(url) %>% read_html()
site %>% html_node('#companyNews:first-child table') %>% html_table()
## Name Age Since Current Position
## 1 John Thompson 66 2014 Independent Chairman of the Board
## 2 Bradford Smith 57 2015 President, Chief Legal Officer
## 3 Satya Nadella 48 2014 Chief Executive Officer, Director
## 4 William Gates 60 2014 Founder and Technology Advisor, Director
## 5 Amy Hood 43 2013 Chief Financial Officer, Executive Vice President
## 6 Christopher Capossela 45 2014 Executive Vice President, Chief Marketing Officer
## 7 Kathleen Hogan 49 2014 Executive Vice President - Human Resources
## 8 Margaret Johnson 54 2014 Executive Vice President - Business Development
## 9 Ifeanyi Amah NA 2016 Chief Technology Officer
## 10 Keith Lorizio NA 2016 Vice President - North America Sales
## 11 Teri List-Stoll 53 2014 Independent Director
## 12 G. Mason Morfit 40 2014 Independent Director
## 13 Charles Noski 63 2003 Independent Director
## 14 Helmut Panke 69 2003 Independent Director
## 15 Charles Scharf 50 2014 Independent Director
## 16 John Stanton 60 2014 Independent Director
## 17 Chris Suh NA NA General Manager - Investor Relations
是的,我知道有类似的问题,我已经阅读了答案并尝试了我可以实现的答案。所以,如果这个问题很愚蠢,请提前道歉:)
我正在从路透社抓取公司董事会成员的年龄以获取公司名单。 这是 link:http://www.reuters.com/finance/stocks/companyOfficers?symbol=MSFT
我正在使用 rvest 库和 selectorgadget 来查找合适的 CSS selector。 这是代码:
library(rvest)
d = read_html("http://www.reuters.com/finance/stocks/companyOfficers?symbol=GAZP.RTS")
d %>% html_nodes("#companyNews:nth-child(1) td:nth-child(2)") %>% html_text()
结果是
character(0)
我觉得我说错了CSSselect或者。你能告诉我如何 select table 吗?
您需要使用 html_session
才能正确加载数据:
library(rvest)
url <- 'http://www.reuters.com/finance/stocks/companyOfficers?symbol=MSFT.O'
site <- html_session(url) %>% read_html()
site %>% html_node('#companyNews:first-child table') %>% html_table()
## Name Age Since Current Position
## 1 John Thompson 66 2014 Independent Chairman of the Board
## 2 Bradford Smith 57 2015 President, Chief Legal Officer
## 3 Satya Nadella 48 2014 Chief Executive Officer, Director
## 4 William Gates 60 2014 Founder and Technology Advisor, Director
## 5 Amy Hood 43 2013 Chief Financial Officer, Executive Vice President
## 6 Christopher Capossela 45 2014 Executive Vice President, Chief Marketing Officer
## 7 Kathleen Hogan 49 2014 Executive Vice President - Human Resources
## 8 Margaret Johnson 54 2014 Executive Vice President - Business Development
## 9 Ifeanyi Amah NA 2016 Chief Technology Officer
## 10 Keith Lorizio NA 2016 Vice President - North America Sales
## 11 Teri List-Stoll 53 2014 Independent Director
## 12 G. Mason Morfit 40 2014 Independent Director
## 13 Charles Noski 63 2003 Independent Director
## 14 Helmut Panke 69 2003 Independent Director
## 15 Charles Scharf 50 2014 Independent Director
## 16 John Stanton 60 2014 Independent Director
## 17 Chris Suh NA NA General Manager - Investor Relations