如何使用 python 根据网站标题抓取网站?
How to scrape websites based on the site's title using python?
我正在为包含特定标题的网站抓取网站。
我将如何做到这一点,例如,检查 "example.com/xxxxxxxxxx" 其中 "x" 是一个随机数,是否具有标题 404?
找到页面标题:
import requests
from lxml.html import fromstring
def Get_PageTitle(url):
req = requests.get(url)
tree = fromstring(req.content)
title = tree.findtext('.//title')
return title
url = "http://www.google.com"
title = Get_PageTitle(url)
if "404" in title:
#title has 404
print("Title has 404 in it")
else:
#no 404 in title
pass
编辑:
上面的代码检查标题是否有 404 in 它。如果你想知道标题是否是404,使用这个代码:
import requests
from lxml.html import fromstring
def Get_PageTitle(url):
req = requests.get(url)
tree = fromstring(req.content)
title = tree.findtext('.//title')
return title
url = "http://www.google.com"
title = Get_PageTitle(url)
if "404" is title:
#title is 404
print("Title is 404 in it")
print(title)
else:
#title is not 404
pass
我正在为包含特定标题的网站抓取网站。 我将如何做到这一点,例如,检查 "example.com/xxxxxxxxxx" 其中 "x" 是一个随机数,是否具有标题 404?
找到页面标题:
import requests
from lxml.html import fromstring
def Get_PageTitle(url):
req = requests.get(url)
tree = fromstring(req.content)
title = tree.findtext('.//title')
return title
url = "http://www.google.com"
title = Get_PageTitle(url)
if "404" in title:
#title has 404
print("Title has 404 in it")
else:
#no 404 in title
pass
编辑:
上面的代码检查标题是否有 404 in 它。如果你想知道标题是否是404,使用这个代码:
import requests
from lxml.html import fromstring
def Get_PageTitle(url):
req = requests.get(url)
tree = fromstring(req.content)
title = tree.findtext('.//title')
return title
url = "http://www.google.com"
title = Get_PageTitle(url)
if "404" is title:
#title is 404
print("Title is 404 in it")
print(title)
else:
#title is not 404
pass