如何用 BeautifulSoup 阅读 asp.net 页面？

Question

我正在尝试使用 beautiful soup 从网页中抓取一些数据。

当我尝试将 HTML 文档转换为 beautifulsoup 对象时，我运行遇到了问题。

当我运行代码

soup = BeautifulSoup(html_doc)

我收到的错误消息是：

SyntaxError: Non-ASCII character '\xa9' in file      C:/Users/mlee/PycharmProjects/BsTest/htmlparse.py on line 683, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

我相信这是因为 html 中有一些 asp.net viewstate 对象是 base64 编码的。

是否有建议的解决方法，或者我必须使用其他工具？

此外，我主要只对获取 javascript 生成的文本部分感兴趣。有更好的方法吗？

谢谢！

Answer 1

放这个header

#!/usr/bin/env python
# -*- coding: utf-8 -*-

在 htmlparse.py 文件的第一行，确保 PyCharm 将文件保存为 utf-8 编码。

这与asp/viewstate无关。您的文件中有 utf 字符。

I am primarily just interested in getting the javascript generated portions of text. Is there a better way of doing this?

您可能想使用 Selenium webdriver + python bindings for doing the task. Another option is PhantomJS

如何用 BeautifulSoup 阅读 asp.net 页面？

How to read a asp.net page with BeautifulSoup?

python

asp.net

beautifulsoup

web-scraping

web