如何在 scrapy 中使用 FEEDS/FEED EXPORTS 导出报废数据
How to export scrapped data using FEEDS/FEED EXPORTS in scrapy
我是 webscraping/scrapy 和 python
的新手
Scrapy版本:Scrapy 2.5.1
OS: windows
IDE: pycharm
我正在尝试使用 scrapy 中的 FEEDS 选项自动从网站导出报废数据以下载到 excel
尝试了以下解决方案但没有奏效 不确定我做错了什么我是否遗漏了什么?
我也尝试在我的 settings.py 文件中添加相同的内容,然后按照中提供的示例在我的蜘蛛 class 中评论 custom_settings文档:https://docs.scrapy.org/en/latest/topics/feed-exports.html?highlight=feed#feeds
现在我使用 spider_closed(信号)将数据写入 CSV,将所有抓取的项目数据存储在一个名为 result[=17 的数组中,从而实现了我的要求=]
class SpiderFC(scrapy.Spider):
name = "FC"
start_urls = [
url,
]
custom_setting = {"FEEDS": {r"C:\Users\rreddy\PycharmProjects\fcdc\webscrp\outputfinal.csv": {"format": "csv", "overwrite": True}}}
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(SpiderFC, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
return spider
def __init__(self, name=None):
super().__init__(name)
self.count = None
def parse(self, response, **kwargs):
# each item scrapped from parent page has links where the actual data need to be scrapped so i follow each link and scrape data
yield response.follow(notice_href_follow, callback=self.parse_item,
meta={'item': item, 'index': index, 'next_page': next_page})
def parse_item(self, response):
# logic for items to scrape goes here
# they are saved to temp list and appended to result array and then temp list is cleared
result.append(it) # result data is used at the end to write to csv
item.clear()
if next_page:
yield next(self.follow_next(response, next_page))
def follow_next(self, response, next_page):
next_page_url = urljoin(url, next_page[0])
yield response.follow(next_page_url, callback=self.parse)
蜘蛛关闭信号
def spider_closed(self, spider):
with open(output_path, mode="a", newline='') as f:
writer = csv.writer(f)
for v in result:
writer.writerow([v["city"]])
当所有数据都被抓取并且所有请求都完成时 spider_closed 信号会将数据写入 csv 但我试图避免这种逻辑或代码并且使用 scrapy 的内置导出器,但我在导出数据时遇到问题
检查你的路径。如果您在 windows 上,请在 custom_settings
中提供完整路径,例如如下
custom_settings = {
"FEEDS":{r"C:\Users\Name\Path\To\outputfinal.csv" : {"format" : "csv", "overwrite":True}}
}
如果您在 linux 或 MAC 上,请提供如下路径:
custom_settings = {
"FEEDS":{r"/Path/to/folder/fcdc/webscrp/outputfinal.csv" : {"format" : "csv", "overwrite":True}}
}
或者提供下面的相对路径,这将在蜘蛛 运行 来自的目录中创建一个 fcdc>>webscrp>>outputfinal.csv
的文件夹结构。
custom_settings = {
"FEEDS":{r"./fcdc/webscrp/outputfinal.csv" : {"format" : "csv", "overwrite":True}}
}
我是 webscraping/scrapy 和 python
的新手Scrapy版本:Scrapy 2.5.1 OS: windows IDE: pycharm
我正在尝试使用 scrapy 中的 FEEDS 选项自动从网站导出报废数据以下载到 excel
尝试了以下解决方案但没有奏效
我也尝试在我的 settings.py 文件中添加相同的内容,然后按照中提供的示例在我的蜘蛛 class 中评论 custom_settings文档:https://docs.scrapy.org/en/latest/topics/feed-exports.html?highlight=feed#feeds
现在我使用 spider_closed(信号)将数据写入 CSV,将所有抓取的项目数据存储在一个名为 result[=17 的数组中,从而实现了我的要求=]
class SpiderFC(scrapy.Spider):
name = "FC"
start_urls = [
url,
]
custom_setting = {"FEEDS": {r"C:\Users\rreddy\PycharmProjects\fcdc\webscrp\outputfinal.csv": {"format": "csv", "overwrite": True}}}
@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
spider = super(SpiderFC, cls).from_crawler(crawler, *args, **kwargs)
crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
return spider
def __init__(self, name=None):
super().__init__(name)
self.count = None
def parse(self, response, **kwargs):
# each item scrapped from parent page has links where the actual data need to be scrapped so i follow each link and scrape data
yield response.follow(notice_href_follow, callback=self.parse_item,
meta={'item': item, 'index': index, 'next_page': next_page})
def parse_item(self, response):
# logic for items to scrape goes here
# they are saved to temp list and appended to result array and then temp list is cleared
result.append(it) # result data is used at the end to write to csv
item.clear()
if next_page:
yield next(self.follow_next(response, next_page))
def follow_next(self, response, next_page):
next_page_url = urljoin(url, next_page[0])
yield response.follow(next_page_url, callback=self.parse)
蜘蛛关闭信号
def spider_closed(self, spider):
with open(output_path, mode="a", newline='') as f:
writer = csv.writer(f)
for v in result:
writer.writerow([v["city"]])
当所有数据都被抓取并且所有请求都完成时 spider_closed 信号会将数据写入 csv 但我试图避免这种逻辑或代码并且使用 scrapy 的内置导出器,但我在导出数据时遇到问题
检查你的路径。如果您在 windows 上,请在 custom_settings
中提供完整路径,例如如下
custom_settings = {
"FEEDS":{r"C:\Users\Name\Path\To\outputfinal.csv" : {"format" : "csv", "overwrite":True}}
}
如果您在 linux 或 MAC 上,请提供如下路径:
custom_settings = {
"FEEDS":{r"/Path/to/folder/fcdc/webscrp/outputfinal.csv" : {"format" : "csv", "overwrite":True}}
}
或者提供下面的相对路径,这将在蜘蛛 运行 来自的目录中创建一个 fcdc>>webscrp>>outputfinal.csv
的文件夹结构。
custom_settings = {
"FEEDS":{r"./fcdc/webscrp/outputfinal.csv" : {"format" : "csv", "overwrite":True}}
}