抓取:无法从 Web 访问信息
Scraping: cannot access information from web
我正在从这个 url 中抓取一些信息:https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon#description-tab
在我抓取描述之前一切都很好。
我试了又试,但到目前为止都失败了。
似乎我无法获得该信息。这是我的代码:
html = urllib.urlopen("https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon")
tree=BeautifulSoup(html, "lxml")
description=tree.find('div',{'id':'description_section','class':'description-section'})
大家有什么建议吗?
我使用 XML 包进行网页抓取,但我无法获得您在 BeautifulSoup 中描述的描述部分。
但是,如果您只想废弃此页面,则可以下载该页面。那么:
页面 = htmlTreeParse("Lunar Lion - the first ever university-led mission to the Moon _ RocketHub.html",
useInternal = TRUE,编码="utf8")
unlist(xpathApply(page, '//div[@id="description_section"]', xmlValue))
我试过R码下载,也没找到description_section
url="https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon"
download.file(url,"page.html",模式="w")
也许我们必须在函数中添加一些选项download.file。希望html高手能帮忙
您需要提出额外请求才能获得描述。这是一个完整的工作示例,使用 requests
+ BeautifulSoup
:
import requests
from bs4 import BeautifulSoup
url = "https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon/"
with requests.Session() as session:
session.headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
}
# get the token
response = session.get(url)
soup = BeautifulSoup(response.content, "html.parser")
token = soup.find("meta", {"name": "csrf-token"})["content"]
# get the description
description_url = url + "description"
response = session.get(description_url, headers={"X-CSRF-Token": token, "X-Requested-With": "XMLHttpRequest"})
soup = BeautifulSoup(response.content, "html.parser")
description = soup.find('div', {'id':'description_section', 'class': 'description-section'})
print(description.get_text(strip=True))
我发现了如何使用 R 进行报废:
library("rvest")
url="https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon/description"
url %>%
html() %>%
html_nodes(xpath='//div[@id="description_section"]', xmlValue) %>%
html_text()
我正在从这个 url 中抓取一些信息:https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon#description-tab
在我抓取描述之前一切都很好。 我试了又试,但到目前为止都失败了。 似乎我无法获得该信息。这是我的代码:
html = urllib.urlopen("https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon")
tree=BeautifulSoup(html, "lxml")
description=tree.find('div',{'id':'description_section','class':'description-section'})
大家有什么建议吗?
我使用 XML 包进行网页抓取,但我无法获得您在 BeautifulSoup 中描述的描述部分。
但是,如果您只想废弃此页面,则可以下载该页面。那么:
页面 = htmlTreeParse("Lunar Lion - the first ever university-led mission to the Moon _ RocketHub.html", useInternal = TRUE,编码="utf8")
unlist(xpathApply(page, '//div[@id="description_section"]', xmlValue))
我试过R码下载,也没找到description_section
url="https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon"
download.file(url,"page.html",模式="w")
也许我们必须在函数中添加一些选项download.file。希望html高手能帮忙
您需要提出额外请求才能获得描述。这是一个完整的工作示例,使用 requests
+ BeautifulSoup
:
import requests
from bs4 import BeautifulSoup
url = "https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon/"
with requests.Session() as session:
session.headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"
}
# get the token
response = session.get(url)
soup = BeautifulSoup(response.content, "html.parser")
token = soup.find("meta", {"name": "csrf-token"})["content"]
# get the description
description_url = url + "description"
response = session.get(description_url, headers={"X-CSRF-Token": token, "X-Requested-With": "XMLHttpRequest"})
soup = BeautifulSoup(response.content, "html.parser")
description = soup.find('div', {'id':'description_section', 'class': 'description-section'})
print(description.get_text(strip=True))
我发现了如何使用 R 进行报废:
library("rvest")
url="https://www.rockethub.com/projects/34210-lunar-lion-the-first-ever-university-led-mission-to-the-moon/description"
url %>%
html() %>%
html_nodes(xpath='//div[@id="description_section"]', xmlValue) %>%
html_text()