递归解析所有分类链接,得到所有商品
Recursively parse all category links and get all products
我一直在玩 web-scraping(这个练习使用 Python 3.6.2),我觉得我正在失去它有点。给定 this 示例 link,这是我想要做的:
首先,如您所见,页面上有多个类别。单击上面的每个类别都会给我其他类别,然后是其他类别,依此类推,直到我到达产品页面。所以我必须深入x次。我以为递归会帮助我实现这个,但我在某个地方做错了。
代码:
在这里,我将解释我处理问题的方式。首先,我创建了一个 session 和一个简单的通用函数,它将 return a lxml.html.HtmlElement
object:
from lxml import html
from requests import Session
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/62.0.3202.94 Safari/537.36"
}
TEST_LINK = 'https://www.richelieu.com/us/en/category/custom-made-cabinet-doors-and-drawers/1000128'
session_ = Session()
def get_page(url):
page = session_.get(url, headers=HEADERS).text
return html.fromstring(page)
然后,我想我还需要另外两个函数:
- 一个获得类别links
- 和另一个得到产品links
为了区分,我发现只有分类页面,每次都有一个标题包含CATEGORIES
,所以我使用了:
def read_categories(page):
categs = []
try:
if 'CATEGORIES' in page.xpath('//div[@class="boxData"][2]/h2')[0].text.strip():
for a in page.xpath('//*[@id="carouselSegment2b"]//li//a'):
categs.append(a.attrib["href"])
return categs
else:
return None
except Exception:
return None
def read_products(page):
return [
a_tag.attrib["href"]
for a_tag in page.xpath("//ul[@id='prodResult']/li//div[@class='imgWrapper']/a")
]
现在,唯一剩下的就是递归部分,我确定我做错了什么:
def read_all_categories(page):
cat = read_categories(page)
if not cat:
yield read_products(page)
else:
yield from read_all_categories(page)
def main():
main_page = get_page(TEST_LINK)
for links in read_all_categories(main_page):
print(links)
这里是所有代码的总和:
from lxml import html
from requests import Session
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/62.0.3202.94 Safari/537.36"
}
TEST_LINK = 'https://www.richelieu.com/us/en/category/custom-made-cabinet-doors-and-drawers/1000128'
session_ = Session()
def get_page(url):
page = session_.get(url, headers=HEADERS).text
return html.fromstring(page)
def read_categories(page):
categs = []
try:
if 'CATEGORIES' in page.xpath('//div[@class="boxData"][2]/h2')[0].text.strip():
for a in page.xpath('//*[@id="carouselSegment2b"]//li//a'):
categs.append(a.attrib["href"])
return categs
else:
return None
except Exception:
return None
def read_products(page):
return [
a_tag.attrib["href"]
for a_tag in page.xpath("//ul[@id='prodResult']/li//div[@class='imgWrapper']/a")
]
def read_all_categories(page):
cat = read_categories(page)
if not cat:
yield read_products(page)
else:
yield from read_all_categories(page)
def main():
main_page = get_page(TEST_LINK)
for links in read_all_categories(main_page):
print(links)
if __name__ == '__main__':
main()
有人可以为我指出有关递归函数的正确方向吗?
下面是我将如何解决这个问题:
from lxml import html as html_parser
from requests import Session
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
}
def dig_up_products(url, session=Session()):
html = session.get(url, headers=HEADERS).text
page = html_parser.fromstring(html)
# if it appears to be a categories page, recurse
for link in page.xpath('//h2[contains(., "CATEGORIES")]/'
'following-sibling::div[@id="carouselSegment1b"]//li//a'):
yield from dig_up_products(link.attrib["href"], session)
# if it appears to be a products page, return the links
for link in page.xpath('//ul[@id="prodResult"]/li//div[@class="imgWrapper"]/a'):
yield link.attrib["href"]
def main():
start = 'https://www.richelieu.com/us/en/category/custom-made-cabinet-doors-and-drawers/1000128'
for link in dig_up_products(start):
print(link)
if __name__ == '__main__':
main()
迭代一个空的 XPath 表达式结果没有错,所以您可以简单地将两种情况(类别 page/products 页面)放入同一个函数中,只要 XPath 表达式足够具体以识别每个案例。
你也可以这样做,让你的脚本稍微简洁一点。我使用 lxml
库和 css selector
来完成这项工作。该脚本将解析 category
下的所有链接并寻找死胡同,当它出现时,它会从那里解析标题并一遍又一遍地执行所有操作,直到所有链接都用完。
from lxml.html import fromstring
import requests
def products_links(link):
res = requests.get(link, headers={"User-Agent": "Mozilla/5.0"})
page = fromstring(res.text)
try:
for item in page.cssselect(".contentHeading h1"): #check for the match available in target page
print(item.text)
except:
pass
for link in page.cssselect("h2:contains('CATEGORIES')+[id^='carouselSegment'] .touchcarousel-item a"):
products_links(link.attrib["href"])
if __name__ == '__main__':
main_page = 'https://www.richelieu.com/us/en/category/custom-made-cabinet-doors-and-drawers/1000128'
products_links(main_page)
部分结果:
BRILLANTÉ DOORS
BRILLANTÉ DRAWER FRONTS
BRILLANTÉ CUT TO SIZE PANELS
BRILLANTÉ EDGEBANDING
LACQUERED ZENIT DOORS
ZENIT CUT-TO-SIZE PANELS
EDGEBANDING
ZENIT CUT-TO-SIZE PANELS
我一直在玩 web-scraping(这个练习使用 Python 3.6.2),我觉得我正在失去它有点。给定 this 示例 link,这是我想要做的:
首先,如您所见,页面上有多个类别。单击上面的每个类别都会给我其他类别,然后是其他类别,依此类推,直到我到达产品页面。所以我必须深入x次。我以为递归会帮助我实现这个,但我在某个地方做错了。
代码:
在这里,我将解释我处理问题的方式。首先,我创建了一个 session 和一个简单的通用函数,它将 return a lxml.html.HtmlElement
object:
from lxml import html
from requests import Session
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/62.0.3202.94 Safari/537.36"
}
TEST_LINK = 'https://www.richelieu.com/us/en/category/custom-made-cabinet-doors-and-drawers/1000128'
session_ = Session()
def get_page(url):
page = session_.get(url, headers=HEADERS).text
return html.fromstring(page)
然后,我想我还需要另外两个函数:
- 一个获得类别links
- 和另一个得到产品links
为了区分,我发现只有分类页面,每次都有一个标题包含CATEGORIES
,所以我使用了:
def read_categories(page):
categs = []
try:
if 'CATEGORIES' in page.xpath('//div[@class="boxData"][2]/h2')[0].text.strip():
for a in page.xpath('//*[@id="carouselSegment2b"]//li//a'):
categs.append(a.attrib["href"])
return categs
else:
return None
except Exception:
return None
def read_products(page):
return [
a_tag.attrib["href"]
for a_tag in page.xpath("//ul[@id='prodResult']/li//div[@class='imgWrapper']/a")
]
现在,唯一剩下的就是递归部分,我确定我做错了什么:
def read_all_categories(page):
cat = read_categories(page)
if not cat:
yield read_products(page)
else:
yield from read_all_categories(page)
def main():
main_page = get_page(TEST_LINK)
for links in read_all_categories(main_page):
print(links)
这里是所有代码的总和:
from lxml import html
from requests import Session
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/62.0.3202.94 Safari/537.36"
}
TEST_LINK = 'https://www.richelieu.com/us/en/category/custom-made-cabinet-doors-and-drawers/1000128'
session_ = Session()
def get_page(url):
page = session_.get(url, headers=HEADERS).text
return html.fromstring(page)
def read_categories(page):
categs = []
try:
if 'CATEGORIES' in page.xpath('//div[@class="boxData"][2]/h2')[0].text.strip():
for a in page.xpath('//*[@id="carouselSegment2b"]//li//a'):
categs.append(a.attrib["href"])
return categs
else:
return None
except Exception:
return None
def read_products(page):
return [
a_tag.attrib["href"]
for a_tag in page.xpath("//ul[@id='prodResult']/li//div[@class='imgWrapper']/a")
]
def read_all_categories(page):
cat = read_categories(page)
if not cat:
yield read_products(page)
else:
yield from read_all_categories(page)
def main():
main_page = get_page(TEST_LINK)
for links in read_all_categories(main_page):
print(links)
if __name__ == '__main__':
main()
有人可以为我指出有关递归函数的正确方向吗?
下面是我将如何解决这个问题:
from lxml import html as html_parser
from requests import Session
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36"
}
def dig_up_products(url, session=Session()):
html = session.get(url, headers=HEADERS).text
page = html_parser.fromstring(html)
# if it appears to be a categories page, recurse
for link in page.xpath('//h2[contains(., "CATEGORIES")]/'
'following-sibling::div[@id="carouselSegment1b"]//li//a'):
yield from dig_up_products(link.attrib["href"], session)
# if it appears to be a products page, return the links
for link in page.xpath('//ul[@id="prodResult"]/li//div[@class="imgWrapper"]/a'):
yield link.attrib["href"]
def main():
start = 'https://www.richelieu.com/us/en/category/custom-made-cabinet-doors-and-drawers/1000128'
for link in dig_up_products(start):
print(link)
if __name__ == '__main__':
main()
迭代一个空的 XPath 表达式结果没有错,所以您可以简单地将两种情况(类别 page/products 页面)放入同一个函数中,只要 XPath 表达式足够具体以识别每个案例。
你也可以这样做,让你的脚本稍微简洁一点。我使用 lxml
库和 css selector
来完成这项工作。该脚本将解析 category
下的所有链接并寻找死胡同,当它出现时,它会从那里解析标题并一遍又一遍地执行所有操作,直到所有链接都用完。
from lxml.html import fromstring
import requests
def products_links(link):
res = requests.get(link, headers={"User-Agent": "Mozilla/5.0"})
page = fromstring(res.text)
try:
for item in page.cssselect(".contentHeading h1"): #check for the match available in target page
print(item.text)
except:
pass
for link in page.cssselect("h2:contains('CATEGORIES')+[id^='carouselSegment'] .touchcarousel-item a"):
products_links(link.attrib["href"])
if __name__ == '__main__':
main_page = 'https://www.richelieu.com/us/en/category/custom-made-cabinet-doors-and-drawers/1000128'
products_links(main_page)
部分结果:
BRILLANTÉ DOORS
BRILLANTÉ DRAWER FRONTS
BRILLANTÉ CUT TO SIZE PANELS
BRILLANTÉ EDGEBANDING
LACQUERED ZENIT DOORS
ZENIT CUT-TO-SIZE PANELS
EDGEBANDING
ZENIT CUT-TO-SIZE PANELS