使用 BeautifulSoup,如何只从特定选择器中获取文本而没有子项中的文本?
Using BeautifulSoup, how to get text only from the specific selector without the text in the children?
我不知道如何编码 BeautifulSoup
以便它只提供所选标签中的文本。我得到更多如其子(ren)的文字!
例如:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<div id="left"><ul><li>"I want this text"<a href="someurl.com"> I don\'t want this text</a><p>I don\'t want this either</li><li>"Good"<a href="someurl.com"> Not Good</a><p> Not Good either</li></ul></div>', "html5lib")
x = soup.select('ul > li')
for i in x:
print(i.text)
输出:
"I want this text" I don't want this textI don't want this either
"Good" Not Good Not Good either
期望的输出:
"I want this text"
"Good"
一个选项是获取 contents
list 的第一个元素:
for i in x:
print(i.contents[0])
另一个 - 找到第一个 文本节点:
for i in x:
print(i.find(text=True))
两者都会打印:
"I want this text"
"Good"
from bs4 import BeautifulSoup
from bs4 import NavigableString
soup = BeautifulSoup('<div id="left"><ul><li>"I want this text"<a href="someurl.com"> I don\'t want this text</a><p>I don\'t want this either</li><li>"Good"<a href="someurl.com"> Not Good</a><p> Not Good either</li></ul></div>', "html5lib")
x = soup.select('ul > li')
for i in x:
if isinstance(i.next_element, NavigableString):#if li's next child is a string
print(i.next_element)
我不知道如何编码 BeautifulSoup
以便它只提供所选标签中的文本。我得到更多如其子(ren)的文字!
例如:
from bs4 import BeautifulSoup
soup = BeautifulSoup('<div id="left"><ul><li>"I want this text"<a href="someurl.com"> I don\'t want this text</a><p>I don\'t want this either</li><li>"Good"<a href="someurl.com"> Not Good</a><p> Not Good either</li></ul></div>', "html5lib")
x = soup.select('ul > li')
for i in x:
print(i.text)
输出:
"I want this text" I don't want this textI don't want this either
"Good" Not Good Not Good either
期望的输出:
"I want this text"
"Good"
一个选项是获取 contents
list 的第一个元素:
for i in x:
print(i.contents[0])
另一个 - 找到第一个 文本节点:
for i in x:
print(i.find(text=True))
两者都会打印:
"I want this text"
"Good"
from bs4 import BeautifulSoup
from bs4 import NavigableString
soup = BeautifulSoup('<div id="left"><ul><li>"I want this text"<a href="someurl.com"> I don\'t want this text</a><p>I don\'t want this either</li><li>"Good"<a href="someurl.com"> Not Good</a><p> Not Good either</li></ul></div>', "html5lib")
x = soup.select('ul > li')
for i in x:
if isinstance(i.next_element, NavigableString):#if li's next child is a string
print(i.next_element)