Scrapy--Can not import the items to my spider (没有模块名称behance.items)

Question

我是 scrapy 的新手，当运行让蜘蛛爬行时 behance

import scrapy
from scrapy.selector import Selector
from behance.items import BehanceItem
from selenium import webdriver
from scrapy.http import TextResponse

from scrapy.crawler import CrawlerProcess

class DmozSpider(scrapy.Spider):
    name = "behance"
    #allowed_domains = ["behance.com"]
    start_urls = [

        "https://www.behance.net/gallery/29535305/Mind-Your-Monsters",


    ]


    def __init__ (self):
        self.driver = webdriver.Firefox()

    def parse(self, response):

            self.driver.get(response.url)
            response = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8')
            item = BehanceItem()
            hxs = Selector(response)

            item['link'] = response.xpath("//div[@class='js-project-module-image-hd project-module module image project-module-image']/@data-hd-src").extract()

            yield   item

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(DmozSpider)
process.start()

当我运行我的爬虫

时，命令行出现以下错误

回溯（最后一次调用）：文件“/home/davy/behance/behance/spiders/behance_spider.py”，第 3 行，位于从 behance.items 导入 BehanceItem

导入错误：没有名为 behance.items

的模块

我的目录结构：

behance/
├── behance
│   ├── __init__.py
│   ├── items.py
│   ├── pipelines.py
│   ├── settings.py
│   └── spiders
│       ├── __init__.py
│       └── behance_spider.py
-── scrapy.cfg

Answer 1

尝试运行使用此命令来控制您的蜘蛛：

scrapy crawl behance

或更改您的蜘蛛文件：

import scrapy
from scrapy.selector import Selector
from behance.items import BehanceItem
from selenium import webdriver
from scrapy.http import TextResponse

from scrapy.crawler import CrawlerProcess

class BehanceSpider(scrapy.Spider):
    name = "behance"
    allowed_domains = ["behance.com"]
    start_urls = [

    "https://www.behance.net/gallery/29535305/Mind-Your-Monsters",


]


    def __init__ (self):
        self.driver = webdriver.Firefox()

    def parse(self, response):

        self.driver.get(response.url)
        response = TextResponse(url=response.url, body=self.driver.page_source, encoding='utf-8')
        item = BehanceItem()
        hxs = Selector(response)

        item['link'] = response.xpath("//div[@class='js-project-module-image-hd project-module module image project-module-image']/@data-hd-src").extract()

        yield   item

并在 settings.py 文件所在的目录中创建另一个 python 文件。

run.py

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

process.crawl("behance")
process.start()

现在运行这个文件作为你运行正常的 python 脚本。 python run.py

Answer 2

您可以将其添加到您的 python 路径：

export PYTHONPATH=$PYTHONPATH:/home/davy/behance/

Scrapy--Can not import the items to my spider (没有模块名称behance.items)

Scrapy--Can not import the items to my spider (No module name behance.items)

python

web-crawler

scrapy