为什么我没有得到输出,也没有网络抓取错误?
Why am I not getting the output nor an error in web scraping?
我正在使用 beautifulsoup 和请求在 google colab 上进行网络抓取作业。在这里我只抓取 google 条新闻的标题。下面是代码:
import requests
from bs4 import BeautifulSoup
def beautiful_soup(url):
'''DEFINING THE FUNCTION HERE THAT SENDS A REQUEST AND PRETTIFIES THE TEXT
INTO SOMETHING THAT IS EASY TO READ'''
request = requests.get(url)
soup = BeautifulSoup(request.text, "lxml")
print(soup.prettify())
beautiful_soup('https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en')
for headlines in soup.find_all('a', {'class': 'VDXfz'}):
print(headlines.text)
问题是,当我 运行 单元格时,它既不显示输出(标题列表)也不显示错误。请帮助它困扰我 2 天。
您可能需要显示下一个 span
元素的文本。这可以按如下方式完成:
import requests
from bs4 import BeautifulSoup
def beautiful_soup(url):
'''DEFINING THE FUNCTION HERE THAT SENDS A REQUEST AND PRETTIFIES THE TEXT
INTO SOMETHING THAT IS EASY TO READ'''
request = requests.get(url)
soup = BeautifulSoup(request.text, "lxml")
#print(soup.prettify())
return soup
soup = beautiful_soup('https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en')
for headlines in soup.find_all('a', {'class': 'VDXfz'}):
print(headlines.find_next('span').text)
这会给你开始的输出:
I Take Back My Comment, Says Ram Madhav After Omar Abdullah’s Dare to Prove Pakistan Charge
Ram Madhav Backpedals On "Instruction From Pak" After Omar Abdullah Dare
National Conference backed PDP to save J&K from uncertainty: Omar Abdullah
On Ram Madhav ‘instruction from Pak’ barb, Omar Abdullah’s stinging reply
Make public reports of horse-trading in govt formation in J-K: Omar Abdullah to Guv
您可以使用以下方法将标题写入 CSV 格式的文件:
import requests
from bs4 import BeautifulSoup
import csv
def beautiful_soup(url):
'''DEFINING THE FUNCTION HERE THAT SENDS A REQUEST AND PRETTIFIES THE TEXT
INTO SOMETHING THAT IS EASY TO READ'''
request = requests.get(url)
soup = BeautifulSoup(request.text, "lxml")
return soup
soup = beautiful_soup('https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en')
with open('output.csv', 'w', newline='', encoding='utf-8') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(['Headline'])
for headlines in soup.find_all('a', {'class': 'VDXfz'}):
headline = headlines.find_next('span').text
print(headline)
csv_output.writerow([headline])
目前这只会生成一个名为 Headline
的列
执行以下脚本,您应该会得到所需的结果。如果您使用选择器,脚本会更清晰。
但是,使用 .find_all()
:
import requests
from bs4 import BeautifulSoup
def get_headlines(url):
request = requests.get(url)
soup = BeautifulSoup(request.text,"lxml")
headlines = [item.find_next("span").text for item in soup.find_all("h3")]
return headlines
if __name__ == '__main__':
link = 'https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en'
for titles in get_headlines(link):
print(titles)
要使用 .select()
执行相同的操作,请在脚本中进行此更改:
headlines = [item.text for item in soup.select("h3 > a > span")]
return headlines
我正在使用 beautifulsoup 和请求在 google colab 上进行网络抓取作业。在这里我只抓取 google 条新闻的标题。下面是代码:
import requests
from bs4 import BeautifulSoup
def beautiful_soup(url):
'''DEFINING THE FUNCTION HERE THAT SENDS A REQUEST AND PRETTIFIES THE TEXT
INTO SOMETHING THAT IS EASY TO READ'''
request = requests.get(url)
soup = BeautifulSoup(request.text, "lxml")
print(soup.prettify())
beautiful_soup('https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en')
for headlines in soup.find_all('a', {'class': 'VDXfz'}):
print(headlines.text)
问题是,当我 运行 单元格时,它既不显示输出(标题列表)也不显示错误。请帮助它困扰我 2 天。
您可能需要显示下一个 span
元素的文本。这可以按如下方式完成:
import requests
from bs4 import BeautifulSoup
def beautiful_soup(url):
'''DEFINING THE FUNCTION HERE THAT SENDS A REQUEST AND PRETTIFIES THE TEXT
INTO SOMETHING THAT IS EASY TO READ'''
request = requests.get(url)
soup = BeautifulSoup(request.text, "lxml")
#print(soup.prettify())
return soup
soup = beautiful_soup('https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en')
for headlines in soup.find_all('a', {'class': 'VDXfz'}):
print(headlines.find_next('span').text)
这会给你开始的输出:
I Take Back My Comment, Says Ram Madhav After Omar Abdullah’s Dare to Prove Pakistan Charge
Ram Madhav Backpedals On "Instruction From Pak" After Omar Abdullah Dare
National Conference backed PDP to save J&K from uncertainty: Omar Abdullah
On Ram Madhav ‘instruction from Pak’ barb, Omar Abdullah’s stinging reply
Make public reports of horse-trading in govt formation in J-K: Omar Abdullah to Guv
您可以使用以下方法将标题写入 CSV 格式的文件:
import requests
from bs4 import BeautifulSoup
import csv
def beautiful_soup(url):
'''DEFINING THE FUNCTION HERE THAT SENDS A REQUEST AND PRETTIFIES THE TEXT
INTO SOMETHING THAT IS EASY TO READ'''
request = requests.get(url)
soup = BeautifulSoup(request.text, "lxml")
return soup
soup = beautiful_soup('https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en')
with open('output.csv', 'w', newline='', encoding='utf-8') as f_output:
csv_output = csv.writer(f_output)
csv_output.writerow(['Headline'])
for headlines in soup.find_all('a', {'class': 'VDXfz'}):
headline = headlines.find_next('span').text
print(headline)
csv_output.writerow([headline])
目前这只会生成一个名为 Headline
执行以下脚本,您应该会得到所需的结果。如果您使用选择器,脚本会更清晰。
但是,使用 .find_all()
:
import requests
from bs4 import BeautifulSoup
def get_headlines(url):
request = requests.get(url)
soup = BeautifulSoup(request.text,"lxml")
headlines = [item.find_next("span").text for item in soup.find_all("h3")]
return headlines
if __name__ == '__main__':
link = 'https://news.google.com/?hl=en-IN&gl=IN&ceid=IN:en'
for titles in get_headlines(link):
print(titles)
要使用 .select()
执行相同的操作,请在脚本中进行此更改:
headlines = [item.text for item in soup.select("h3 > a > span")]
return headlines