Python liburl 用语言抓取站点内容

Question

我想从网站上获取一些内容 www.gyte.edu.tr 网站语言是土耳其语，但是当你点击网站上地址为 www.gyte.edu.tr?cl=2 的语言选择按钮时，它会变成英语.我想让我的代码访问http://www.gyte.edu.tr/kategori/54/9/laboratories.aspx？地址并获取所有实验室链接，而不是访问所有实验室页面并从这些页面获取信息。使用下面的代码，它会获取土耳其语的信息，但不会获取英语的信息。

import urllib
from bs4 import BeautifulSoup

urllib.urlopen("http://www.gyte.edu.tr?cl=2")
linkler = urllib.urlopen("http://www.gyte.edu.tr?cl=2/kategori/54/9/laboratories.aspx")
site = linkler.read()
linkler.close()
link_list = []

soup1 = BeautifulSoup(site)
a_text = soup1.find("div","block news-area")

for link in a_text.find_all('a'):
    link_list.append(link.get('href'))
for l in link_list:
    s = urllib.urlopen(l)
    s1 = s.read()   
    s.close()
    soup3 = BeautifulSoup(s1)
    soup3 = soup3.table
    soup3 = str(soup3)
    f = open("table.html", 'a')
    #  write the data
for data in soup3:
    f.write(data)

那么如何抓取英文内容呢？

Answer 1

他们正在设置 cookie，因此语言选择在整个会话中保持不变。

import requests
s = requests.Session()
#Sets language to english and saves cookie in Session s
s.get('http://www.gyte.edu.tr/?cl=2')
#Page in english
r = s.get("http://www.gyte.edu.tr/kategori/54/9/laboratories.aspx")

更多关于 requests.Session()

http://docs.python-requests.org/en/latest/user/advanced/

Python liburl 用语言抓取站点内容

Python liburl grab site content with language

python

urllib