为什么 BeautifulSoup 从该网页丢失了这么多内容？

Question

我有一个我一年前建的网络爬虫。我需要再次使用它，但 bs4 的行为似乎有所不同。它曾经 return 包含整个网页的汤对象，但现在它停在列表中间。我需要获取所有列表项，所以这破坏了我的旧代码。

我已经在寻找与 beautiful soup 类似的问题，这里的一个人确实遇到了类似的问题，但是解决方案（针对 select 特定的 div 元素）不起作用对我来说，因为我需要整个网页的内容来抓取所有网址。

这是我正在使用的代码：

import requests
from bs4 import BeautifulSoup


def siteopen(url):
    web_source = url
    source_code = requests.get(web_source)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, "lxml")
    return soup


print(siteopen('http://celt.ucc.ie/irlpage.html'))

plain_text 包含我想要抓取的所有 html，但是，soup 元素并不包含所有这些。它在列表项后不久停止，显然是随机的。

我运行此代码在 PyCharm 社区版中。也许那里会设置一些大小限制？否则，我该如何解决这个问题并访问完整的 soup 对象？

编辑：

因为其他人已经在 linux 和 PyCharm Pro 中成功运行这个，我在 mac OS 中尝试了运行终端，问题在那里重新创建。在我在 PyCharm 中遇到问题的同时，输出是这样的：

Answer 1

我可以毫无问题地查看该请求的所有数据。也许 PyCharm 正在限制它允许在单个打印件上显示的文本量。

您可以通过运行

验证

import requests
from bs4 import BeautifulSoup


def siteopen(url):
    web_source = url
    source_code = requests.get(web_source, verify=False)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text, "lxml")
    return soup


print("</html>" in str(siteopen('http://celt.ucc.ie/irlpage.html')))

如果返回 True，您就知道它已拉取整个页面。

检查此项以查看是否可以增加输出缓冲区限制：

为什么 BeautifulSoup 从该网页丢失了这么多内容？

Why is BeautifulSoup losing so much content from this webpage?

python

lxml

beautifulsoup

pycharm