如何使用 Python 从维基百科抓取链接
How to scrape links from Wikipedia with Python
我正在尝试使用 python 从维基百科上的 "List of Naval Battles" 中抓取所有战斗链接。问题是我不知道如何将所有包含单词“/wiki/Battle”的链接导出到我的 CSV 文件中。我习惯了 C++,所以 python 对我来说有点陌生。有任何想法吗?
这是我目前所拥有的...
from bs4 import BeautifulSoup
import urllib2
rootUrl = "https://en.wikipedia.org/wiki/List_of_naval_battles"
def get_soup(url,header):
return
BeautifulSoup(
urllib2.urlopen(urllib2.Request(url,headers=header)),'html.parser')
# soup settings
url = rootUrl + item
header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}
soup = get_soup(url,header)
battle = soup.findAll("/wiki/Battle")
试试这个:
from bs4 import BeautifulSoup as bs
import requests
res = requests.get("https://en.wikipedia.org/wiki/List_of_naval_battles")
soup = bs(res.text, "html.parser")
naval_battles = {}
for link in soup.find_all("a"):
url = link.get("href", "")
if "/wiki/Battle" in url:
naval_battles[link.text.strip()] = url
print(naval_battles)
我正在尝试使用 python 从维基百科上的 "List of Naval Battles" 中抓取所有战斗链接。问题是我不知道如何将所有包含单词“/wiki/Battle”的链接导出到我的 CSV 文件中。我习惯了 C++,所以 python 对我来说有点陌生。有任何想法吗? 这是我目前所拥有的...
from bs4 import BeautifulSoup
import urllib2
rootUrl = "https://en.wikipedia.org/wiki/List_of_naval_battles"
def get_soup(url,header):
return
BeautifulSoup(
urllib2.urlopen(urllib2.Request(url,headers=header)),'html.parser')
# soup settings
url = rootUrl + item
header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"}
soup = get_soup(url,header)
battle = soup.findAll("/wiki/Battle")
试试这个:
from bs4 import BeautifulSoup as bs
import requests
res = requests.get("https://en.wikipedia.org/wiki/List_of_naval_battles")
soup = bs(res.text, "html.parser")
naval_battles = {}
for link in soup.find_all("a"):
url = link.get("href", "")
if "/wiki/Battle" in url:
naval_battles[link.text.strip()] = url
print(naval_battles)