如何使用 python 根据网站标题抓取网站？

Question

我正在为包含特定标题的网站抓取网站。我将如何做到这一点，例如，检查 "example.com/xxxxxxxxxx" 其中 "x" 是一个随机数，是否具有标题 404？

Answer 1

找到页面标题：

import requests
from lxml.html import fromstring

def Get_PageTitle(url):
    req = requests.get(url)
    tree = fromstring(req.content)
    title = tree.findtext('.//title')
    return title


url = "http://www.google.com"
title = Get_PageTitle(url)

if "404" in title:
    #title has 404
    print("Title has 404 in it")

else:
    #no 404 in title
    pass

编辑：

上面的代码检查标题是否有 404 in 它。如果你想知道标题是否是404，使用这个代码：

import requests
from lxml.html import fromstring

def Get_PageTitle(url):
    req = requests.get(url)
    tree = fromstring(req.content)
    title = tree.findtext('.//title')
    return title


url = "http://www.google.com"
title = Get_PageTitle(url)

if "404" is title:
    #title is 404
    print("Title is 404 in it")
    print(title)

else:
    #title is not 404
    pass

How to get page title in requests

如何使用 python 根据网站标题抓取网站？

How to scrape websites based on the site's title using python?

python

screen-scraping

http