使用 python 3.5 从静态 HTML 文件中提取数据
Extract data from STATIC HTML FILE using python 3.5
我在本地计算机上保存了静态 HTML 页面。我尝试使用简单的文件打开和 BeautifulSoup。打开文件后,由于 unicode 错误,它不会读取整个 html 文件,并且 BeautifulSoup 它适用于实时网站。
#with beautifulSoup
from bs4 import BeautifulSoup
import urllib.request
url="Stack Overflow.html"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page.read())
universities=soup.find_all('a',class_='institution')
for university in universities:
print(university['href']+","+university.string)
#Simple file read
with open('Stack Overflow.html', encoding='utf-8') as f:
for line in f:
print(repr(line))
阅读HTML后,我想从ul
和li
中提取没有任何属性的数据。欢迎任何推荐。
我不明白你的意思。我只知道您想从本地存储读取整个 html 数据并用 bs4
.
解析一些 DOM
对吗?
我在这里推荐一些代码:
from bs4 import BeautifulSoup
with open("Stack Overflow.html", encoding="utf-8") as f:
data = f.read()
soup = BeautifulSoup(data, 'html.parser')
# universities = soup.find_all('a', class_='institution')
# for university in universities:
# print(university['href'] + "," + university.string)
ul_list = soup.select("ul")
for ul in ul_list:
if not ul.attrs:
for li in ul.select("li"):
if not li.attrs:
print(li.get_text().strip())
这道题是关于如何构造一个BeautifulSoup对象的。
To parse a document, pass it into the BeautifulSoup constructor. You
can pass in a string or an open filehandle:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"))
soup = BeautifulSoup("<html>data</html>")
只需要传递一个文件对象给BeautifulSoup,不需要特别添加编码信息,BS会处理
First, the document is converted to Unicode, and HTML entities are
converted to Unicode characters:
如果您在提取数据时遇到问题,您应该 post html 代码。
摘录:
import bs4
html = '''<ul class="indent"> <li><i>dependency-check version</i>: 1.4.3</li> <li><i>Report Generated On</i>: Dec 30, 2016 at 13:33:27 UTC</li> <li><i>Dependencies Scanned</i>: 0 (0 unique)</li> <li><i>Vulnerable Dependencies</i>: 0</li> <li><i>Vulnerabilities Found</i>: 0</li> <li><i>Vulnerabilities Suppressed</i>: 0</li> <li class="scaninfo">...</li>'''
soup = bs4.BeautifulSoup(html, 'lxml')
for i in soup.find_all('li', class_=False):
print(i.text)
输出:
dependency-check version: 1.4.3
Report Generated On: Dec 30, 2016 at 13:33:27 UTC
Dependencies Scanned: 0 (0 unique)
Vulnerable Dependencies: 0
Vulnerabilities Found: 0
Vulnerabilities Suppressed: 0
我在本地计算机上保存了静态 HTML 页面。我尝试使用简单的文件打开和 BeautifulSoup。打开文件后,由于 unicode 错误,它不会读取整个 html 文件,并且 BeautifulSoup 它适用于实时网站。
#with beautifulSoup
from bs4 import BeautifulSoup
import urllib.request
url="Stack Overflow.html"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page.read())
universities=soup.find_all('a',class_='institution')
for university in universities:
print(university['href']+","+university.string)
#Simple file read
with open('Stack Overflow.html', encoding='utf-8') as f:
for line in f:
print(repr(line))
阅读HTML后,我想从ul
和li
中提取没有任何属性的数据。欢迎任何推荐。
我不明白你的意思。我只知道您想从本地存储读取整个 html 数据并用 bs4
.
对吗?
我在这里推荐一些代码:
from bs4 import BeautifulSoup
with open("Stack Overflow.html", encoding="utf-8") as f:
data = f.read()
soup = BeautifulSoup(data, 'html.parser')
# universities = soup.find_all('a', class_='institution')
# for university in universities:
# print(university['href'] + "," + university.string)
ul_list = soup.select("ul")
for ul in ul_list:
if not ul.attrs:
for li in ul.select("li"):
if not li.attrs:
print(li.get_text().strip())
这道题是关于如何构造一个BeautifulSoup对象的。
To parse a document, pass it into the BeautifulSoup constructor. You can pass in a string or an open filehandle:
from bs4 import BeautifulSoup soup = BeautifulSoup(open("index.html")) soup = BeautifulSoup("<html>data</html>")
只需要传递一个文件对象给BeautifulSoup,不需要特别添加编码信息,BS会处理
First, the document is converted to Unicode, and HTML entities are converted to Unicode characters:
如果您在提取数据时遇到问题,您应该 post html 代码。
摘录:
import bs4
html = '''<ul class="indent"> <li><i>dependency-check version</i>: 1.4.3</li> <li><i>Report Generated On</i>: Dec 30, 2016 at 13:33:27 UTC</li> <li><i>Dependencies Scanned</i>: 0 (0 unique)</li> <li><i>Vulnerable Dependencies</i>: 0</li> <li><i>Vulnerabilities Found</i>: 0</li> <li><i>Vulnerabilities Suppressed</i>: 0</li> <li class="scaninfo">...</li>'''
soup = bs4.BeautifulSoup(html, 'lxml')
for i in soup.find_all('li', class_=False):
print(i.text)
输出:
dependency-check version: 1.4.3
Report Generated On: Dec 30, 2016 at 13:33:27 UTC
Dependencies Scanned: 0 (0 unique)
Vulnerable Dependencies: 0
Vulnerabilities Found: 0
Vulnerabilities Suppressed: 0