Python Beautiful Soup 最高效的标签查找方式
Python Beautiful Soup Most Efficient Way to Find Tags
我正在使用 python 和 BeautifulSoup 解析许多大型 XML 文件。我经常运行进入以下任务:
<Section1>
<Report>
<Matrix>...</Matrix>
<Matrix>...</Matrix>
<Matrix>...</Matrix>
<Matrix>...</Matrix>
</Report>
</Section1>
我正在尝试收集并遍历所有矩阵。我使用如下代码:
res = urlopen(url)
html = res.read()
soup = BeautifulSoup(html, 'xml')
matrices = soup.find("Section1").find_all("Matrix")
#Then I handle each matrix
为什么我不能使用这样的选择器?
matrices = soup.find("Section1 Matrix")
有更快的方法吗?有时我正在访问嵌套在 XML 中更远的节点,我需要确保它们是后代,但不一定是其他几个节点的直接子节点。提供的示例是一种简化。任何帮助将不胜感激。
BeautifulSoup "supports CSS selectors" 您需要将选择器传递给 .select
方法
In [1]: from bs4 import BeautifulSoup as BS
In [2]: soup = BS("""<Section1>
...: <Report>
...: <Matrix>...</Matrix>
...: <Matrix>...</Matrix>
...: <Matrix>...</Matrix>
...: <Matrix>...</Matrix>
...: </Report>
...: </Section1>""", "xml")
In [3]: soup.select("Section1 Matrix")
Out[3]:
[<Matrix>...</Matrix>,
<Matrix>...</Matrix>,
<Matrix>...</Matrix>,
<Matrix>...</Matrix>]
如果您想要获取文档中的所有 Matrix
个节点;您可以使用
CSSSelector
class from lxml.cssselect
1.
In [3]: from lxml.etree import fromstring
In [4]: xml_doc = '''<Section1>
...: <Report>
...: <Matrix>...</Matrix>
...: <Matrix>...</Matrix>
...: <Matrix>...</Matrix>
...: <Matrix>...</Matrix>
...: </Report>
...: </Section1>'''
In [5]: tree = fromstring(xml_doc)
In [6]: matrix = [el for el in sel(tree)]
In [7]: matrix
Out[7]:
[<Element Matrix at 0x7f84b5b8f388>,
<Element Matrix at 0x7f84b5b8fc48>,
<Element Matrix at 0x7f84b5b8fd88>,
<Element Matrix at 0x7f84b5b8fdc8>]
1 如果 cssselect 尚未与 pip 一起安装,则需要安装它:pip install cssselect
我正在使用 python 和 BeautifulSoup 解析许多大型 XML 文件。我经常运行进入以下任务:
<Section1>
<Report>
<Matrix>...</Matrix>
<Matrix>...</Matrix>
<Matrix>...</Matrix>
<Matrix>...</Matrix>
</Report>
</Section1>
我正在尝试收集并遍历所有矩阵。我使用如下代码:
res = urlopen(url)
html = res.read()
soup = BeautifulSoup(html, 'xml')
matrices = soup.find("Section1").find_all("Matrix")
#Then I handle each matrix
为什么我不能使用这样的选择器?
matrices = soup.find("Section1 Matrix")
有更快的方法吗?有时我正在访问嵌套在 XML 中更远的节点,我需要确保它们是后代,但不一定是其他几个节点的直接子节点。提供的示例是一种简化。任何帮助将不胜感激。
BeautifulSoup "supports CSS selectors" 您需要将选择器传递给 .select
方法
In [1]: from bs4 import BeautifulSoup as BS
In [2]: soup = BS("""<Section1>
...: <Report>
...: <Matrix>...</Matrix>
...: <Matrix>...</Matrix>
...: <Matrix>...</Matrix>
...: <Matrix>...</Matrix>
...: </Report>
...: </Section1>""", "xml")
In [3]: soup.select("Section1 Matrix")
Out[3]:
[<Matrix>...</Matrix>,
<Matrix>...</Matrix>,
<Matrix>...</Matrix>,
<Matrix>...</Matrix>]
如果您想要获取文档中的所有 Matrix
个节点;您可以使用
CSSSelector
class from lxml.cssselect
1.
In [3]: from lxml.etree import fromstring
In [4]: xml_doc = '''<Section1>
...: <Report>
...: <Matrix>...</Matrix>
...: <Matrix>...</Matrix>
...: <Matrix>...</Matrix>
...: <Matrix>...</Matrix>
...: </Report>
...: </Section1>'''
In [5]: tree = fromstring(xml_doc)
In [6]: matrix = [el for el in sel(tree)]
In [7]: matrix
Out[7]:
[<Element Matrix at 0x7f84b5b8f388>,
<Element Matrix at 0x7f84b5b8fc48>,
<Element Matrix at 0x7f84b5b8fd88>,
<Element Matrix at 0x7f84b5b8fdc8>]
1 如果 cssselect 尚未与 pip 一起安装,则需要安装它:pip install cssselect