漂亮的汤解析器找不到链接
beautiful soup parser can't find links
我试图解析 HTML 文档以使用 Beautiful Soup
查找链接,但发现了一个奇怪的行为。该页面是 http://people.csail.mit.edu/gjtucker/ 。这是我的代码:
from bs4 import BeautifulSoup
import requests
user_agent = {'User-agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17'}
t=requests.get(url, headers = user_agent).text
soup=BeautifulSoup(t, 'html.parser')
for link in soup.findAll('a'):
print link['href']
这会打印两个链接:http://www.amazon.jobs/team/speech-amazon
和 https://scholar.google.com/citations?user=-gJkPHIAAAAJ&hl=en
,而页面中显然还有更多链接。
有人可以重现吗? URL 发生这种情况是否有特定原因?一些 outher url 工作得很好。
页面的HTML格式不正确,您应该使用more lenient parser,例如html5lib
:
soup = BeautifulSoup(t, 'html5lib')
for link in soup.find_all('a'):
print(link['href'])
打印:
http://www.amazon.jobs/team/speech-amazon
https://scholar.google.com/citations?user=-gJkPHIAAAAJ&hl=en
http://www.linkedin.com/pub/george-tucker/6/608/3ba
...
http://www.hsph.harvard.edu/alkes-price/
...
http://www.nature.com/ng/journal/v47/n3/full/ng.3190.html
http://www.biomedcentral.com/1471-2105/14/299
pdfs/journal.pone.0029095.pdf
pdfs/es201187u.pdf
pdfs/sigtrans.pdf
我试图解析 HTML 文档以使用 Beautiful Soup
查找链接,但发现了一个奇怪的行为。该页面是 http://people.csail.mit.edu/gjtucker/ 。这是我的代码:
from bs4 import BeautifulSoup
import requests
user_agent = {'User-agent': 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.52 Safari/537.17'}
t=requests.get(url, headers = user_agent).text
soup=BeautifulSoup(t, 'html.parser')
for link in soup.findAll('a'):
print link['href']
这会打印两个链接:http://www.amazon.jobs/team/speech-amazon
和 https://scholar.google.com/citations?user=-gJkPHIAAAAJ&hl=en
,而页面中显然还有更多链接。
有人可以重现吗? URL 发生这种情况是否有特定原因?一些 outher url 工作得很好。
页面的HTML格式不正确,您应该使用more lenient parser,例如html5lib
:
soup = BeautifulSoup(t, 'html5lib')
for link in soup.find_all('a'):
print(link['href'])
打印:
http://www.amazon.jobs/team/speech-amazon
https://scholar.google.com/citations?user=-gJkPHIAAAAJ&hl=en
http://www.linkedin.com/pub/george-tucker/6/608/3ba
...
http://www.hsph.harvard.edu/alkes-price/
...
http://www.nature.com/ng/journal/v47/n3/full/ng.3190.html
http://www.biomedcentral.com/1471-2105/14/299
pdfs/journal.pone.0029095.pdf
pdfs/es201187u.pdf
pdfs/sigtrans.pdf