由于 "View The Full list" 按钮,最多 10 个项目
Max of 10 items due to "View The Full list" button
link = http://fortune.com/worlds-most-admired-companies/2016/
所以,我想要 div 内的所有 'href' 和已知的 'class name'
我无法摆脱这个:
import bs4 as bs
import urllib.request
raw = urllib.request.urlopen('http://fortune.com/worlds-most-admired-companies/2016/')
soup = bs.BeautifulSoup(raw, 'lxml')
listdiv = soup.find('div', clsss_="company-franchise-result-content current")
for url in listdiv.find_all('a'):
print(url.get('href'))
我以前用过:
for a in soup.find_all('a'):
print(a.get('href'))
它有效,但只有 returns 10 个项目,从苹果到通用电气。即使当我输入 link 时,当我单击 "View the Full list" 按钮时也会得到。
我对 JSON 的工作原理一无所知,但看起来这是朝着那个方向发展的。
完整的数据其实在HTML里面。它就在 script
标签内的 JavaScript 对象内。您可以找到此 script
标记,获取它的文本,提取 JSON 字符串,将其加载到 Python 数据结构中 json.loads()
并获取所需数据:
In [1]: from bs4 import BeautifulSoup
In [2]: import json
In [3]: import re
In [4]: url = "http://fortune.com/worlds-most-admired-companies/2016/"
In [5]: response = requests.get(url)
In [6]: soup = BeautifulSoup(response.content, "lxml")
In [7]: pattern = re.compile(r"var fortune_wp_vars = ({.*?});", re.DOTALL | re.MULTILINE)
In [8]: script = soup.find("script", text=pattern)
In [9]: data = json.loads(pattern.search(script.get_text()).group(1))
In [10]: companies = data["bootstrap"]["franchise"]["filtered_sorted_data"]
In [11]: for company in companies:
...: print(company["title"])
...:
Apple
Alphabet
...
Yum Brands
ZF Friedrichshafen
Zurich Insurance Group
link = http://fortune.com/worlds-most-admired-companies/2016/
所以,我想要 div 内的所有 'href' 和已知的 'class name' 我无法摆脱这个:
import bs4 as bs
import urllib.request
raw = urllib.request.urlopen('http://fortune.com/worlds-most-admired-companies/2016/')
soup = bs.BeautifulSoup(raw, 'lxml')
listdiv = soup.find('div', clsss_="company-franchise-result-content current")
for url in listdiv.find_all('a'):
print(url.get('href'))
我以前用过:
for a in soup.find_all('a'):
print(a.get('href'))
它有效,但只有 returns 10 个项目,从苹果到通用电气。即使当我输入 link 时,当我单击 "View the Full list" 按钮时也会得到。 我对 JSON 的工作原理一无所知,但看起来这是朝着那个方向发展的。
完整的数据其实在HTML里面。它就在 script
标签内的 JavaScript 对象内。您可以找到此 script
标记,获取它的文本,提取 JSON 字符串,将其加载到 Python 数据结构中 json.loads()
并获取所需数据:
In [1]: from bs4 import BeautifulSoup
In [2]: import json
In [3]: import re
In [4]: url = "http://fortune.com/worlds-most-admired-companies/2016/"
In [5]: response = requests.get(url)
In [6]: soup = BeautifulSoup(response.content, "lxml")
In [7]: pattern = re.compile(r"var fortune_wp_vars = ({.*?});", re.DOTALL | re.MULTILINE)
In [8]: script = soup.find("script", text=pattern)
In [9]: data = json.loads(pattern.search(script.get_text()).group(1))
In [10]: companies = data["bootstrap"]["franchise"]["filtered_sorted_data"]
In [11]: for company in companies:
...: print(company["title"])
...:
Apple
Alphabet
...
Yum Brands
ZF Friedrichshafen
Zurich Insurance Group