抓取可见文本
Scraping visible text
我是网络抓取领域的绝对新手,现在我想从网页中提取可见文本。我在网上找到了一段代码:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.espncricinfo.com/"
web_page = urllib2.urlopen(url)
soup = BeautifulSoup(url , "lxml")
print (soup.prettify())
对于上面的代码,我得到以下结果:
/usr/local/lib/python2.7/site-packages/bs4/__init__.py:282: UserWarning: "http://www.espncricinfo.com/" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
' that document to Beautiful Soup.' % decoded_markup
<html>
<body>
<p>
http://www.espncricinfo.com/
</p>
</body>
</html>
无论如何,我可以获得更具体的结果以及代码出了什么问题。对不起,我无能为力。
尝试传递 html 文档而不是 url 以美化为:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.espncricinfo.com/"
web_page = urllib2.urlopen(url)
soup = BeautifulSoup(web_page , 'html.parser')
print (soup.prettify().encode('utf-8'))
soup = BeautifulSoup(web_page, "lxml")
您应该将类似文件的对象传递给 BeautifulSoup,而不是 url。
url 由 urllib2.urlopen(url)
处理并存储在 web_page
我是网络抓取领域的绝对新手,现在我想从网页中提取可见文本。我在网上找到了一段代码:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.espncricinfo.com/"
web_page = urllib2.urlopen(url)
soup = BeautifulSoup(url , "lxml")
print (soup.prettify())
对于上面的代码,我得到以下结果:
/usr/local/lib/python2.7/site-packages/bs4/__init__.py:282: UserWarning: "http://www.espncricinfo.com/" looks like a URL. Beautiful Soup is not an HTTP client. You should probably use an HTTP client like requests to get the document behind the URL, and feed that document to Beautiful Soup.
' that document to Beautiful Soup.' % decoded_markup
<html>
<body>
<p>
http://www.espncricinfo.com/
</p>
</body>
</html>
无论如何,我可以获得更具体的结果以及代码出了什么问题。对不起,我无能为力。
尝试传递 html 文档而不是 url 以美化为:
import urllib2
from bs4 import BeautifulSoup
url = "http://www.espncricinfo.com/"
web_page = urllib2.urlopen(url)
soup = BeautifulSoup(web_page , 'html.parser')
print (soup.prettify().encode('utf-8'))
soup = BeautifulSoup(web_page, "lxml")
您应该将类似文件的对象传递给 BeautifulSoup,而不是 url。
url 由 urllib2.urlopen(url)
处理并存储在 web_page