带有 BS4 的简单蜘蛛神秘地将每一页翻倍

Question

所以我用 BS4 编写了一个非常简单的一级深度蜘蛛。目标是那些令人讨厌的 html-books-online 格式（比如在文档中），其中有一个 table 目录页面，然后所有内容都在从主目录页面链接的页面上。假设所有内容都是原版 html。目的是为了离线阅读保存那种东西。因此，技术就是简单地创建一个主页外的唯一链接列表，从每个链接上抓取内容，然后将整个内容连接成一个大 html 页面，然后可以在闲暇时离线阅读.

除了一个令人发狂的小错误外，它工作完美：在最终的 html 文件中，每个子页面出现两次 。正好两次。我一直在这个 bash 脚本教程 http://www.tldp.org/LDP/Bash-Beginners-Guide/html/ 上测试它（我不拥有它的权利，顺便说一句，虽然版权条款允许复制，所以请不要在任何东西上测试它猛击服务器或以其他方式不礼貌）。

我检查过的内容：

我已经确认底层页面本身不包含隐藏在其中的重复内容。
我已经验证 uniques 确实包含唯一链接列表。
我已经验证 len(texts) == len(uniques) + 1 符合预期。

现在这真的开始难倒我了。这可能是某种明显的愚蠢错误，但我根本看不到它并且快要疯了。谁能看到这里出了什么问题？谢谢！

from bs4 import BeautifulSoup as BS
import urllib 

def clearJunk(BSobj):
    [s.extract() for s in BSobj(['style', 'script'])]

def makeSoup(url):
    r = urllib.urlopen(url)
    soup = BS(r)
    clearJunk(soup)
    return soup

def getBody(BSobj):
    return ' '.join([str(i) for i in BSobj.find('body').findChildren()])

def stripAnchor(url):
    badness = url.find('#')
    if badness != -1:
        return url[:badness]
    return url

url = raw_input('URL to crawl: ')
soup = makeSoup(url)

links = filter(lambda x: 'mailto:' not in x, [url + stripAnchor(alink['href']) for alink in soup.find_all('a', href=True)])
uniques = [s for (i,s) in enumerate(links) if s not in links[0:i]]

texts = [getBody(makeSoup(aurl)) for aurl in uniques]
texts.insert(0, getBody(soup))
from time import gmtime, strftime
filename = 'scrape' + str(strftime("%Y%m%d%H%M%S", gmtime())) + '.html'
with open(filename, 'w') as outfile:
    outfile.write('<br><br>'.join(texts))

print 'scraping complete!'

Answer 1

问题在于您在 getBody() 中查找 body 标签的子标签的方式。默认情况下，它以 递归方式工作 ，示例：

from bs4 import BeautifulSoup

data = """
<body>
    <div>
        <b>test1</b>
    </div>
    <div>
        <b>test2</b>
    </div>
</body>
"""

soup = BeautifulSoup(data, "html.parser")
for item in soup.find('body').findChildren():
    print(item)

它将打印：

<div>
<b>test1</b>
</div>
<b>test1</b>
<div>
<b>test2</b>
</div>
<b>test2</b>

请参阅 test 和 test2 重复。

我认为你打算将 find_all() 与 recursive=False:

一起使用

' '.join([str(i) for i in BSobj.find('body').find_all(recursive=False)])

这是上面提供的示例 HTML 的输出结果：

>>> for item in soup.find('body').find_all(recursive=False):
...     print(item)
... 
<div>
<b>test1</b>
</div>
<div>
<b>test2</b>
</div>

带有 BS4 的简单蜘蛛神秘地将每一页翻倍

Simple spider with BS4 mysteriously doubles every page

python

beautifulsoup

web-crawler