BeautifulSoup 有时只输出数据?
BeautifulSoup only outputs data sometimes?
所以我将 link 抓取到 this subreddit 上的所有帖子(特别是过去 24 小时的热门帖子。)
但是当我 运行 我的程序时,它有时会输出所有数据,而其他时候什么都不输出。完全相同的代码。它工作大约 1/5 的时间。
# URL of subreddit
test = requests.get('https://www.reddit.com/r/TikTokCringe/top/')
# the html of the request
html = test.text
# making a soup of the html
soup = BeautifulSoup(html, 'html.parser')
# the find_all is finding the first 30 a elements that have a href that starts with '/r/TikTokCringe/comments'
for href in soup.find_all('a', {"href": re.compile('/r/TikTokCringe/comments/*')})[:30]:
# im looping through every element because I eventually want to get just the links
# for now im just trying to print every element
print(href)
您收到 HTTP 错误 429 - 请求过多。尝试放慢速度或设置 User-Agent
HTTP header:
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0"
}
# URL of subreddit
test = requests.get("https://reddit.com/r/TikTokCringe/top/", headers=headers)
...
另外:考虑使用他们的 JSON 格式(在 URL 末尾添加 .json
):
data = requests.get(
"https://reddit.com/r/TikTokCringe/top/.json", headers=headers
).json()
print(data)
所以我将 link 抓取到 this subreddit 上的所有帖子(特别是过去 24 小时的热门帖子。) 但是当我 运行 我的程序时,它有时会输出所有数据,而其他时候什么都不输出。完全相同的代码。它工作大约 1/5 的时间。
# URL of subreddit
test = requests.get('https://www.reddit.com/r/TikTokCringe/top/')
# the html of the request
html = test.text
# making a soup of the html
soup = BeautifulSoup(html, 'html.parser')
# the find_all is finding the first 30 a elements that have a href that starts with '/r/TikTokCringe/comments'
for href in soup.find_all('a', {"href": re.compile('/r/TikTokCringe/comments/*')})[:30]:
# im looping through every element because I eventually want to get just the links
# for now im just trying to print every element
print(href)
您收到 HTTP 错误 429 - 请求过多。尝试放慢速度或设置 User-Agent
HTTP header:
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:99.0) Gecko/20100101 Firefox/99.0"
}
# URL of subreddit
test = requests.get("https://reddit.com/r/TikTokCringe/top/", headers=headers)
...
另外:考虑使用他们的 JSON 格式(在 URL 末尾添加 .json
):
data = requests.get(
"https://reddit.com/r/TikTokCringe/top/.json", headers=headers
).json()
print(data)