想刮 flipkart
Want to scrape flipkart
简而言之,我必须抓取 Flipkart 并将数据存储在 Mongodb。
首先,使用 MongoDB Atlas 获得一个免费的托管 Mongodb 服务器。使用 python 的库 pymongo 测试您是否能够连接到它。
其次,安装 Scrapy 并使用它的文档让自己熟悉使用 Scrapy 框架进行抓取。
然后,进入下面2urls
女士鞋履 https://www.flipkart.com/womens-footwear/pr?sid=osp,iko&otracker=nmenu_sub_Women_0_Footwear
每页有 40 个产品,您必须从每个起始页抓取最多 25 页 Url(大约 2000 个产品)并将数据存储在 Mongodb(数据库:,collection:手推车)。数据应该使用 Scrapy Mongodb 管道直接从 Scrapy 框架插入 Mongodb。
您抓取的每个产品都应具有以下数据:
name
[存储为字符串]
brand
[存储为字符串]
original_price
[存储为浮点数]
sale_price
[存储为浮点数]
image_url
[存储为字符串]
product_page_url
[存储为字符串]
product_category
[存储为字符串] [它可以包含 2 个值“女鞋”或“男上衣”]
但我只能抓取品牌、标题销售价格和产品 url 原始价格有两个字符串并且它越来越不匹配,我无法将数据保存在 mongodb 中任何人都可以帮助我。
from ..items import FlipkartItem
import json
import scrapy
import re
class FlipkartscrapySpider(scrapy.Spider):
name = 'flipkartscrapy'
def start_requests(self):
urls = ['https://www.flipkart.com/clothing-and-accessories/topwear/pr?sid=clo%2Cash&otracker=categorytree&p%5B%5D=facets.ideal_for%255B%255D%3DMen&page={}',
'https://www.flipkart.com/womens-footwear/pr?sid=osp%2Ciko&otracker=nmenu_sub_Women_0_Footwear&page={}']
for url in urls:
for i in range(1,25):
x = url.format(i)
yield scrapy.Request(url=x, callback=self.parse)
def parse(self, response):
items = FlipkartItem()
name = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "IRpwTa", " " ))]').xpath('text()').getall()
brand = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "_2WkVRV", " " ))]').xpath('text()').getall()
original_price = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "_3I9_wc", " " ))]').xpath('text()').getall()
sale_price = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "_30jeq3", " " ))]').xpath('text()').getall()
image_url = response.css('._1a8UBa').css('::attr(src)').getall()
product_page_url =response.css('._13oc-S > div').css('::attr(href)').getall()
items['name'] = name
items['brand'] = brand
items['original_price'] = original_price
items['sale_price'] = sale_price
items['image_url'] = image_url
items['product_page_url'] = 'https://www.flipkart.com' + str(product_page_url)
yield items
原价输出是这样的original_price
:
['₹', '999', '₹', '1,499', '₹', '1,888', '₹', '2,199', '₹', '1,499', '₹', '1,069', '₹', '1,099', '₹', '1,999', '₹', '2,598', '₹', '1,299', '₹', '1,999', '₹', '899', '₹', '1,099', '₹', '1,699', '₹', '1,399', '₹', '999', '₹', '999', '₹', '1,999', '₹', '1,099', '₹', '1,199', '₹', '999', '₹', '999', '₹', '1,999', '₹', '1,287', '₹', '999', '₹', '1,199', '₹', '899', '₹', '999', '₹', '1,849', '₹', '1,499', '₹', '999', '₹', '999', '₹', '899', '₹', '1,999', '₹', '1,849', '₹', '3,499', '₹', '2,397', '₹', '899', '₹', '1,999']
original_price
HTML 是:<div class="_3I9_wc">₹<!-- -->2,199</div>
。由于它被 HTML 注释 (<!-- -->
) 打断,Xpath 返回 2 个元素而不是一个。
解决方法:
original_price = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "_3I9_wc", " " ))]').xpath('text()').getall()
original_price = [price for price in original_price if price != '₹']
附带说明:您还可以简化 original_price xpath 代码:original_price = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "_3I9_wc", " " ))]/text()').getall()
items['original_price'] = [original_price[i]+original_price[i+1] for i in range(0, len(original_price), 2)]
简而言之,我必须抓取 Flipkart 并将数据存储在 Mongodb。
首先,使用 MongoDB Atlas 获得一个免费的托管 Mongodb 服务器。使用 python 的库 pymongo 测试您是否能够连接到它。
其次,安装 Scrapy 并使用它的文档让自己熟悉使用 Scrapy 框架进行抓取。
然后,进入下面2urls
女士鞋履 https://www.flipkart.com/womens-footwear/pr?sid=osp,iko&otracker=nmenu_sub_Women_0_Footwear
每页有 40 个产品,您必须从每个起始页抓取最多 25 页 Url(大约 2000 个产品)并将数据存储在 Mongodb(数据库:,collection:手推车)。数据应该使用 Scrapy Mongodb 管道直接从 Scrapy 框架插入 Mongodb。
您抓取的每个产品都应具有以下数据:
name
[存储为字符串]brand
[存储为字符串]original_price
[存储为浮点数]sale_price
[存储为浮点数]image_url
[存储为字符串]product_page_url
[存储为字符串]product_category
[存储为字符串] [它可以包含 2 个值“女鞋”或“男上衣”]
但我只能抓取品牌、标题销售价格和产品 url 原始价格有两个字符串并且它越来越不匹配,我无法将数据保存在 mongodb 中任何人都可以帮助我。
from ..items import FlipkartItem
import json
import scrapy
import re
class FlipkartscrapySpider(scrapy.Spider):
name = 'flipkartscrapy'
def start_requests(self):
urls = ['https://www.flipkart.com/clothing-and-accessories/topwear/pr?sid=clo%2Cash&otracker=categorytree&p%5B%5D=facets.ideal_for%255B%255D%3DMen&page={}',
'https://www.flipkart.com/womens-footwear/pr?sid=osp%2Ciko&otracker=nmenu_sub_Women_0_Footwear&page={}']
for url in urls:
for i in range(1,25):
x = url.format(i)
yield scrapy.Request(url=x, callback=self.parse)
def parse(self, response):
items = FlipkartItem()
name = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "IRpwTa", " " ))]').xpath('text()').getall()
brand = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "_2WkVRV", " " ))]').xpath('text()').getall()
original_price = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "_3I9_wc", " " ))]').xpath('text()').getall()
sale_price = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "_30jeq3", " " ))]').xpath('text()').getall()
image_url = response.css('._1a8UBa').css('::attr(src)').getall()
product_page_url =response.css('._13oc-S > div').css('::attr(href)').getall()
items['name'] = name
items['brand'] = brand
items['original_price'] = original_price
items['sale_price'] = sale_price
items['image_url'] = image_url
items['product_page_url'] = 'https://www.flipkart.com' + str(product_page_url)
yield items
原价输出是这样的original_price
:
['₹', '999', '₹', '1,499', '₹', '1,888', '₹', '2,199', '₹', '1,499', '₹', '1,069', '₹', '1,099', '₹', '1,999', '₹', '2,598', '₹', '1,299', '₹', '1,999', '₹', '899', '₹', '1,099', '₹', '1,699', '₹', '1,399', '₹', '999', '₹', '999', '₹', '1,999', '₹', '1,099', '₹', '1,199', '₹', '999', '₹', '999', '₹', '1,999', '₹', '1,287', '₹', '999', '₹', '1,199', '₹', '899', '₹', '999', '₹', '1,849', '₹', '1,499', '₹', '999', '₹', '999', '₹', '899', '₹', '1,999', '₹', '1,849', '₹', '3,499', '₹', '2,397', '₹', '899', '₹', '1,999']
original_price
HTML 是:<div class="_3I9_wc">₹<!-- -->2,199</div>
。由于它被 HTML 注释 (<!-- -->
) 打断,Xpath 返回 2 个元素而不是一个。
解决方法:
original_price = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "_3I9_wc", " " ))]').xpath('text()').getall()
original_price = [price for price in original_price if price != '₹']
附带说明:您还可以简化 original_price xpath 代码:original_price = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "_3I9_wc", " " ))]/text()').getall()
items['original_price'] = [original_price[i]+original_price[i+1] for i in range(0, len(original_price), 2)]