Web 抓取每个论坛 post (Python, Beautifulsoup)

Question

各位堆友们大家好。简短描述.. 我正在使用 Python 从汽车论坛上抓取一些数据并将所有数据保存到 CSV 文件中。在其他 Whosebug 成员的帮助下，设法挖掘了某个主题的所有页面，收集了每个 post.

的日期、标题和 link

我还有一个单独的脚本，我现在正在努力实现（对于每一个 link 找到的，python 都会为它创建一个新汤，擦除所有 post 和然后回到之前的 link).

非常感谢任何其他关于如何使它变得更好的提示或建议，因为这是我第一次使用 python，我认为这可能是我的嵌套循环逻辑搞砸了，但检查了多次对我来说似乎是对的。

这是代码片段：

        link += (div.get('href'))
        savedData += "\n" + title + ", " + link
        tempSoup = make_soup('http://www.automotiveforums.com/vbulletin/' + link)
        while tempNumber < 3:
            for tempRow in tempSoup.find_all(id=re.compile("^td_post_")):
                for tempNext in tempSoup.find_all(title=re.compile("^Next Page -")):
                    tempNextPage = ""
                    tempNextPage += (tempNext.get('href'))
                post = ""
                post += tempRow.get_text(strip=True)
                postData += post + "\n"
            tempNumber += 1
            tempNewUrl = "http://www.automotiveforums.com/vbulletin/" + tempNextPage
            tempSoup = make_soup(tempNewUrl)
            print(tempNewUrl)
    tempNumber = 1
    number += 1
    print(number)
    newUrl = "http://www.automotiveforums.com/vbulletin/" + nextPage
    soup = make_soup(newUrl)

到目前为止，我的主要问题是 tempSoup = make_soup('http://www.automotiveforums.com/vbulletin/' + link) 在为论坛线程抓取所有 post 之后似乎没有创建新汤。

这是我得到的输出：

 http://www.automotiveforums.com/vbulletin/showthread.php?s=6a2caa2b46531be10e8b1c4acb848776&t=1139532&page=2
    http://www.automotiveforums.com/vbulletin/showthread.php?s=6a2caa2b46531be10e8b1c4acb848776&t=1139532&page=3
    1

所以它似乎确实为新页面找到了正确的 links 并抓取了它们，但是对于下一次迭代，它会打印新日期和完全相同的页面。在打印最后一个 link 之后还有一个非常奇怪的 10-12 秒延迟，然后才跳到打印数字 1 然后 bash 输出所有新日期..

但是在进入下一个论坛主题后 link，它每次都抓取相同的数据。

抱歉，如果它看起来真的很乱，它有点像是一个副项目，并且是我第一次尝试做一些有用的事情，所以我对此很陌生，任何建议或技巧将不胜感激。我不是要你为我解决代码，即使是对我可能错误的逻辑的一些指示也将不胜感激！

Answer 1

所以在花了更多时间之后，我几乎成功破解了它。现在是 python 找到论坛上的每个帖子并且它是 link，然后进入每个 link，阅读所有页面并继续下一个 link。

如果有人会使用它，这是固定代码。

    link += (div.get('href'))
    savedData += "\n" + title + ", " + link
    soup3 = make_soup('http://www.automotiveforums.com/vbulletin/' + link)
    while tempNumber < 4:
        for postScrape in soup3.find_all(id=re.compile("^td_post_")):
            post = ""
            post += postScrape.get_text(strip=True)
            postData += post + "\n"
            print(post)
        for tempNext in soup3.find_all(title=re.compile("^Next Page -")):
            tempNextPage = ""
            tempNextPage += (tempNext.get('href'))
            print(tempNextPage)
        soup3 = ""
        soup3 = make_soup('http://www.automotiveforums.com/vbulletin/' + tempNextPage)
        tempNumber += 1
    tempNumber = 1
number += 1
print(number)
newUrl = "http://www.automotiveforums.com/vbulletin/" + nextPage
soup = make_soup(newUrl)

我所要做的就是将彼此嵌套的 2 个 for 循环分离成自己的循环。仍然不是一个完美的解决方案，但是嘿，它几乎可以工作。

非工作位：提供的前 2 个线程 link 有多页帖子。以下 10+ 个线程不要。我想不出检查 for tempNext in soup3.find_all(title=re.compile("^Next Page -")): 的方法循环外的值看它是否为空。因为如果它没有找到下一个页面元素/href，它只会使用最后一个。但是，如果我在每个运行之后重置该值，它就不再挖掘每个页面 =l 一个解决方案，它只是创造了另一个问题 :D.

Answer 2

非常感谢亲爱的 Norbis 分享您的想法、见解和概念

因为你只提供了一个片段，我只是尝试提供一种方法来展示如何登录到 phpBB - 使用负载：

import requests
forum = "the forum name"

headers = {'User-Agent': 'Mozilla/5.0'}
payload = {'username': 'username', 'password': 'password', 'redirect':'index.php', 'sid':'', 'login':'Login'}
session = requests.Session()

r = session.post(forum + "ucp.php?mode=login", headers=headers, data=payload)
print(r.text)

但是等一下：我们可以 - 而不是使用请求来操纵网站，还可以使用诸如 mechanize 之类的浏览器自动化功能。这样我们就不必管理自己的会话，只需几行代码就可以制作每个请求。

一个有趣的例子在 GitHub https://github.com/winny-/sirsi/blob/317928f23847f4fe85e2428598fbe44c4dae2352/sirsi/sirsi.py#L74-L211

Web 抓取每个论坛 post (Python, Beautifulsoup)

Web scraping every forum post (Python, Beautifulsoup)

python

nested-loops

web-scraping

pycharm