为 AI 虚拟助手抓取网站数据

Question

我目前正在 python 3.7 中制作 AI 虚拟助手。如果您不熟悉术语（虚拟助手），这里有一些示例，Siri、Google Home、Alexa、Bixby 等。我正在开发的 AI 可以打开并将您重定向到一个网站，该网站将满足你的命令，例如，如果我要求新闻它会打开 cnn，这里是打开 cnn 的代码：

webbrowser.open("https://www.cnn.com")

但是，我希望 AI 打印 CNN 网站上的突发新闻。如果您想知道我在彩票网站上尝试过类似的东西

import json
import urllib.request
from pprint import pprint

websource = urllib.request.urlopen ('http://www.masslottery.com/data/json/games/lottery/recent.json')
data = json.loads(websource.read().decode())
pprint(data)

谢谢！

Answer 1

您正在查找的内容称为 RSS 提要，大多数新闻站点都有，因此您可以轻松解析新报道。

对于 CNN，您可以在此处查看：http://www.cnn.com/services/rss/ 并选择您想要的 RSS 提要。

假设你想看这里的头条新闻，你会接受这个 http://rss.cnn.com/rss/cnn_topstories.rss from the link repository I initially posted and you would request the data from the page and parse what you want from it, most likely with the python beautifulsoup library, a version 4 tutorial can be found here: https://www.pythonforbeginners.com/beautifulsoup/beautifulsoup-4-python

Answer 2

对于 python，您应该查看用于 Web 自动化的 Beautiful Soup 和 Selenium。

查看 XPath 和 Css 选择器。

了解如何在浏览器中使用调试器。前任。 Chrome 开发工具或 firebug...

为 AI 虚拟助手抓取网站数据

scraping a website for data for AI virtual assistant

python

python-3.7