BeautifulSoup 仅获取 td 标签中的 "general" 文本,嵌套标签中不获取任何内容
BeautifulSoup get only the "general" text in a td tag, and nothing in nested tags
假设我的 html 看起来像这样:
<td>Potato1 <span somestuff...>Potato2</span></td>
...
<td>Potato9 <span somestuff...>Potato10</span></td>
我已经 beautifulsoup 这样做了:
for tag in soup.find_all("td"):
print tag.text
然后我得到
Potato1 Potato2
....
Potato9 Potato10
是否可以只获取标签内的文本而不获取 span 标签内嵌套的任何文本?
您可以使用 .contents
作为
>>> for tag in soup.find_all("td"):
... print tag.contents[0]
...
Potato1
Potato9
它有什么作用?
使用 .contents
.
可以将子标签作为列表提供
>>> for tag in soup.find_all("td"):
... print tag.contents
...
[u'Potato1 ', <span somestuff...="">Potato2</span>]
[u'Potato9 ', <span somestuff...="">Potato10</span>]
因为我们只对第一个元素感兴趣,所以我们选择
print tag.contents[0]
另一种方法,与 tag.contents[0]
不同,它保证文本是
NavigableString
而不是子 Tag
中的文本,是:
[child for tag in soup.find_all("td")
for child in tag if isinstance(child, bs.NavigableString)]
这是一个突出差异的示例:
import bs4 as bs
content = '''
<td>Potato1 <span>Potato2</span></td>
<td><span>FOO</span></td>
<td><span>Potato10</span>Potato9</td>
'''
soup = bs.BeautifulSoup(content)
print([tag.contents[0] for tag in soup.find_all("td")])
# [u'Potato1 ', <span>FOO</span>, <span>Potato10</span>]
print([child for tag in soup.find_all("td")
for child in tag if isinstance(child, bs.NavigableString)])
# [u'Potato1 ', u'Potato9']
或者,对于 lxml,您可以使用 XPath td/text()
:
import lxml.html as LH
content = '''
<td>Potato1 <span>Potato2</span></td>
<td><span>FOO</span></td>
<td><span>Potato10</span>Potato9</td>
'''
root = LH.fromstring(content)
print(root.xpath('td/text()'))
产量
['Potato1 ', 'Potato9']
假设我的 html 看起来像这样:
<td>Potato1 <span somestuff...>Potato2</span></td>
...
<td>Potato9 <span somestuff...>Potato10</span></td>
我已经 beautifulsoup 这样做了:
for tag in soup.find_all("td"):
print tag.text
然后我得到
Potato1 Potato2
....
Potato9 Potato10
是否可以只获取标签内的文本而不获取 span 标签内嵌套的任何文本?
您可以使用 .contents
作为
>>> for tag in soup.find_all("td"):
... print tag.contents[0]
...
Potato1
Potato9
它有什么作用?
使用 .contents
.
>>> for tag in soup.find_all("td"):
... print tag.contents
...
[u'Potato1 ', <span somestuff...="">Potato2</span>]
[u'Potato9 ', <span somestuff...="">Potato10</span>]
因为我们只对第一个元素感兴趣,所以我们选择
print tag.contents[0]
另一种方法,与 tag.contents[0]
不同,它保证文本是
NavigableString
而不是子 Tag
中的文本,是:
[child for tag in soup.find_all("td")
for child in tag if isinstance(child, bs.NavigableString)]
这是一个突出差异的示例:
import bs4 as bs
content = '''
<td>Potato1 <span>Potato2</span></td>
<td><span>FOO</span></td>
<td><span>Potato10</span>Potato9</td>
'''
soup = bs.BeautifulSoup(content)
print([tag.contents[0] for tag in soup.find_all("td")])
# [u'Potato1 ', <span>FOO</span>, <span>Potato10</span>]
print([child for tag in soup.find_all("td")
for child in tag if isinstance(child, bs.NavigableString)])
# [u'Potato1 ', u'Potato9']
或者,对于 lxml,您可以使用 XPath td/text()
:
import lxml.html as LH
content = '''
<td>Potato1 <span>Potato2</span></td>
<td><span>FOO</span></td>
<td><span>Potato10</span>Potato9</td>
'''
root = LH.fromstring(content)
print(root.xpath('td/text()'))
产量
['Potato1 ', 'Potato9']