为什么我没有得到输出，也没有网络抓取错误？

Question

我正在使用 beautifulsoup 和请求在 google colab 上进行网络抓取作业。在这里我只抓取 google 条新闻的标题。下面是代码：

import requests
from bs4 import BeautifulSoup

def beautiful_soup(url):
'''DEFINING THE FUNCTION HERE THAT SENDS A REQUEST AND PRETTIFIES THE TEXT 
INTO SOMETHING THAT IS EASY TO READ'''

request = requests.get(url)
soup = BeautifulSoup(request.text, "lxml")
print(soup.prettify())

beautiful_soup('https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en')

for headlines in soup.find_all('a', {'class': 'VDXfz'}):
   print(headlines.text)

问题是，当我运行单元格时，它既不显示输出（标题列表）也不显示错误。请帮助它困扰我 2 天。

Answer 1

您可能需要显示下一个 span 元素的文本。这可以按如下方式完成：

import requests
from bs4 import BeautifulSoup

def beautiful_soup(url):
    '''DEFINING THE FUNCTION HERE THAT SENDS A REQUEST AND PRETTIFIES THE TEXT 
       INTO SOMETHING THAT IS EASY TO READ'''

    request = requests.get(url)
    soup = BeautifulSoup(request.text, "lxml")
    #print(soup.prettify())
    return soup

soup = beautiful_soup('https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en')

for headlines in soup.find_all('a', {'class': 'VDXfz'}):
    print(headlines.find_next('span').text)

这会给你开始的输出：

I Take Back My Comment, Says Ram Madhav After Omar Abdullah’s Dare to Prove Pakistan Charge
Ram Madhav Backpedals On "Instruction From Pak" After Omar Abdullah Dare
National Conference backed PDP to save J&K from uncertainty: Omar Abdullah
On Ram Madhav ‘instruction from Pak’ barb, Omar Abdullah’s stinging reply
Make public reports of horse-trading in govt formation in J-K: Omar Abdullah to Guv

您可以使用以下方法将标题写入 CSV 格式的文件：

import requests
from bs4 import BeautifulSoup
import csv

def beautiful_soup(url):
    '''DEFINING THE FUNCTION HERE THAT SENDS A REQUEST AND PRETTIFIES THE TEXT 
       INTO SOMETHING THAT IS EASY TO READ'''

    request = requests.get(url)
    soup = BeautifulSoup(request.text, "lxml")
    return soup

soup = beautiful_soup('https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en')

with open('output.csv', 'w', newline='', encoding='utf-8') as f_output:
    csv_output = csv.writer(f_output)
    csv_output.writerow(['Headline'])

    for headlines in soup.find_all('a', {'class': 'VDXfz'}):
        headline = headlines.find_next('span').text
        print(headline)
        csv_output.writerow([headline])

目前这只会生成一个名为 Headline

的列

Answer 2

执行以下脚本，您应该会得到所需的结果。如果您使用选择器，脚本会更清晰。

但是，使用 .find_all():

import requests
from bs4 import BeautifulSoup

def get_headlines(url):
    request = requests.get(url)
    soup = BeautifulSoup(request.text,"lxml")
    headlines = [item.find_next("span").text for item in soup.find_all("h3")]
    return headlines

if __name__ == '__main__':
    link = 'https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en'
    for titles in get_headlines(link):
        print(titles)

要使用 .select() 执行相同的操作，请在脚本中进行此更改：

headlines = [item.text for item in soup.select("h3 > a > span")]
return headlines

为什么我没有得到输出，也没有网络抓取错误？

Why am I not getting the output nor an error in web scraping?

python

beautifulsoup

web-scraping

python-requests

google-colaboratory