无法在数据存储中保存数据但没有错误

Question

我正在构建网络爬虫。我输入到数据存储区的一些数据得到保存，其他的没有保存，我不知道是什么问题。

这是我的爬虫class

class Crawler(object):

    def get_page(self, url):
        try:
            req = urllib2.Request(url, headers={'User-Agent': "Magic Browser"}) #  yessss!!! with the header, I am able to download pages
            #response = urlfetch.fetch(url, method='GET')
            #return response.content
        #except urlfetch.InvalidURLError as iu:
         #   return iu.message
            response = urllib2.urlopen(req)
            return response.read()

        except urllib2.HTTPError as e:
            return e.reason


    def get_all_links(self, page):
         return re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',page)


    def union(self, lyst1, lyst2):
        try:
            for elmt in lyst2:
                if elmt not in lyst1:
                    lyst1.append(elmt)
            return lyst1
        except e:
            return e.reason

#function that  crawls the web for links starting from the seed
#returns a dictionary of index and graph
    def crawl_web(self, seed="http://tonaton.com/"):
        query = Listings.query() #create a listings object from storage
        if query.get():
            objListing = query.get()
        else:
            objListing = Listings()
            objListing.toCrawl = [seed]
            objListing.Crawled = []

        start_time = datetime.datetime.now()
        while datetime.datetime.now()-start_time < datetime.timedelta(0,5):#tocrawl (to crawl can take forever)
            try:
                #while True:
                page = objListing.toCrawl.pop()

                if page not in objListing.Crawled:
                    content = self.get_page(page)
                    add_page_to_index(page, content)
                    outlinks = self.get_all_links(content)
                    graph = Graph() #create a graph object with the url
                    graph.url = page
                    graph.links = outlinks #save all outlinks as the value part of the graph url
                    graph.put()

                    self.union(objListing.toCrawl, outlinks)
                    objListing.Crawled.append(page)
            except:
                return False

        objListing.put() #save to database
        return True #return true if it works

定义各种 ndb 模型的 classes 在此 python 模块中：

import os
import urllib
from google.appengine.ext import ndb
import webapp2

class Listings(ndb.Model):
    toCrawl = ndb.StringProperty(repeated=True)
    Crawled = ndb.StringProperty(repeated=True)

#let's see how this works

class Index(ndb.Model):
    keyword = ndb.StringProperty() # keyword part of the index
    url = ndb.StringProperty(repeated=True) # value part of the index

#class Links(ndb.Model):
 #   links = ndb.JsonProperty(indexed=True)

class Graph(ndb.Model):
    url = ndb.StringProperty()
    links = ndb.StringProperty(repeated=True)

当我用 JsonProperty 代替 StringProperty(repeated=true) 时，它曾经工作正常。但是 JsonProperty 限制为 1500 字节，所以我有一次出错。

现在，当我运行 crawl_web 成员函数时，它实际上是在爬网，但是当我检查数据存储时，它只创建了索引实体。没有图表，没有列表。请帮忙。谢谢。

Answer 1

将您的代码放在一起，添加缺少的导入，并记录异常，最终显示第一个杀手问题：

Exception Indexed value links must be at most 500 characters

事实上，添加 outlinks 的日志记录，很容易看出其中几个的长度远远超过 500 个字符——因此它们不能是索引属性中的项目，例如作为 StringProperty。将每个重复的 StringProperty 更改为重复的 TextProperty （因此它不会被索引，因此没有每个项目 500 个字符的限制），代码运行了一段时间（制作了一些 Graph) 但最终死于：

An error occured while connecting to the server: Unable to fetch URL: https://sb':'http://b')+'.scorecardresearch.com/beacon.js';document.getElementsByTagName('head')[0].appendChild(s); Error: [Errno 8] nodename nor servname provided, or not known

事实上，很明显所谓的 "link" 实际上是一堆 Javascript，因此无法获取。

因此，从本质上讲，您代码中的核心错误与 App Engine 根本无关，而是您的正则表达式存在问题：

'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

是否不正确提取包含 Javascript 以及 HTML.

的网页的传出链接

您的代码有很多问题，但到目前为止，他们只是减慢了代码速度或使其更难理解，而不是杀死它——真正杀死它的是使用正则表达式模式来尝试提取链接来自页面。

查看 retrieve links from web page using python and BeautifulSoup——大多数答案建议，为了从页面中提取链接，使用 BeautifulSoup，这可能是应用引擎中的一个问题，但有人展示了如何只用 Python 和 REs.

无法在数据存储中保存数据但没有错误

unable to save data in datastore but no errors

python

google-app-engine

google-cloud-datastore