简单 python 网络爬虫

Question

我一直在寻找一种简单、有效的网络爬虫（也称为蜘蛛），但一直找不到。谁能帮我解决这个问题？我希望它能够简单地获取来自指定 url 的所有链接以及来自它扫描的每个 url 的所有链接。

Answer 1

您可以使用 scrapy 模块。

或者，您可以编写自己的爬虫，结合使用用于获取数据的模块（即请求、urllib2 或 selenium）和一些 HTML 解析器（BeautifulSoup 或 selenium 的内置-在解析器中）。

Answer 2

我之前没有尝试过制作网络爬虫。但是，我想它不应该太复杂。我会给你一些你可以使用的资源。

我不知道有什么模块可以简单地为您获取所有 link，因此您可能必须自己完成该过程。

首先，用 urllib2. Then, parse the HTML and find the links with BeautifulSoup 从你 link 那里得到 HTML。页面上什至有一个部分描述了如何从网页中获取所有 link。

这真的是所有 "difficult" 代码。然后，您可以将获得的所有 link 附加到列表中，遍历每个 link，重复与上述相同的过程，然后再次将生成的 link 添加到列表中，并递归地重复这个过程，无论你想要多长时间。那应该为您制作一个基本的网络爬虫。

Answer 3

您可以将其用作网络爬虫，但我不确定它是否有效，因为它会给我一些错误，但您可能安装了另一个 python 路径

import requests
from bs4 import BeautifulSoup


    def trade_spider(max_pages):
    page = 1
    while page <= max_pages:
    url = "https://buckysroom.org/trade/search.php?page=" + str(page)
    source_code = requests.get(url)
    # just get the code, no headers or anything
    plain_text = source_code.text
    # BeautifulSoup objects can be sorted through easy
    soup = BeautifulSoup(plain_text)
    for link in soup.findAll('a', {'class': 'item-name'}):
        href = "https://buckysroom.org" + link.get('href')
        title = link.string  # just the text, not the HTML
        print(href)
        print(title)
        # get_single_item_data(href)
    page += 1


def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
# if you want to gather information from that page
for item_name in soup.findAll('div', {'class': 'i-name'}):
    print(item_name.string)
 # if you want to gather links for a web crawler
 for link in soup.findAll('a'):
    href = "https://buckysroom.org" + link.get('href')
    print(href)


trade_spider(1)

简单 python 网络爬虫

Simple python web-crawler

python

url

web-crawler