将元素作为字符串添加到 BeautifulSoup 的 find_all 列表
Adding elements to BeautifulSoup's find_all list as a string
我正在使用 BeautifulSoup 的 findall() 函数测试网络抓取概念。我正在尝试获取在 div class='dinner' 内具有 class='first' 的 p 标签的内容。
from bs4 import BeautifulSoup
import urllib2
html_doc="""
<html>
<head>
<title>The practice html document</title>
</head>
<body>
<div class='dinner'>
<p class='first'>I like pizza</p>
<p class='second'>I really like pizza</p>
<p class='first'>pizza is good</p>
</div>
<div class='breakfast'>
<p class='first'>pancake</p>
</div>
<div class='lunch'>
<p> This is a paragraph</p>
</div>
</body>
</html>
"""
soup=BeautifulSoup(html_doc)
div_stuff=soup.find("div", attrs={'class':'dinner'})
print div_stuff
print '\n'
#This prints the paragraphs only in the div with the class dinner
div_paragraphs=unicode(div_stuff.find_all('p', attrs={'class':'first'}))
print div_paragraphs
findall 函数将它找到的段落作为列表中的一个元素。这是代码的输出:
<div class="dinner">
<p class="first">I like pizza</p>
<p class="second">I really like pizza</p>
<p class="first">pizza is good</p>
</div>
[<p class="first">I like pizza</p>, <p class="first">pizza is good</p>]
目标是仅将段落的内容作为列表中的字符串获取。像这样:
[I like pizza,pizza is good]
我可以编写一些代码来遍历每个元素并在找到所有实例后替换它们,但我想看看是否有办法在 findall 将每个元素存储到列表中之前将它们变成字符串。
.findall()
将 return 匹配;您正在寻找元素,而不是寻找包含的文本(这将是一个非常不同的搜索)。
您可以轻松地提取列表理解中的文本:
[elem.get_text() for elem in soup.select('div.dinner p.first')]
我在这里使用了 CSS selector 来匹配 p
标签在其 div
parents.
上下文中
演示:
>>> from bs4 import BeautifulSoup
>>> html_doc="""
... <html>
... <head>
... <title>The practice html document</title>
... </head>
... <body>
... <div class='dinner'>
... <p class='first'>I like pizza</p>
... <p class='second'>I really like pizza</p>
... <p class='first'>pizza is good</p>
... </div>
... <div class='breakfast'>
... <p class='first'>pancake</p>
... </div>
... <div class='lunch'>
... <p> This is a paragraph</p>
... </div>
... </body>
... </html>
... """
>>> soup = BeautifulSoup(html_doc)
>>> [elem.get_text() for elem in soup.select('div.dinner p.first')]
[u'I like pizza', u'pizza is good']
我正在使用 BeautifulSoup 的 findall() 函数测试网络抓取概念。我正在尝试获取在 div class='dinner' 内具有 class='first' 的 p 标签的内容。
from bs4 import BeautifulSoup
import urllib2
html_doc="""
<html>
<head>
<title>The practice html document</title>
</head>
<body>
<div class='dinner'>
<p class='first'>I like pizza</p>
<p class='second'>I really like pizza</p>
<p class='first'>pizza is good</p>
</div>
<div class='breakfast'>
<p class='first'>pancake</p>
</div>
<div class='lunch'>
<p> This is a paragraph</p>
</div>
</body>
</html>
"""
soup=BeautifulSoup(html_doc)
div_stuff=soup.find("div", attrs={'class':'dinner'})
print div_stuff
print '\n'
#This prints the paragraphs only in the div with the class dinner
div_paragraphs=unicode(div_stuff.find_all('p', attrs={'class':'first'}))
print div_paragraphs
findall 函数将它找到的段落作为列表中的一个元素。这是代码的输出:
<div class="dinner">
<p class="first">I like pizza</p>
<p class="second">I really like pizza</p>
<p class="first">pizza is good</p>
</div>
[<p class="first">I like pizza</p>, <p class="first">pizza is good</p>]
目标是仅将段落的内容作为列表中的字符串获取。像这样:
[I like pizza,pizza is good]
我可以编写一些代码来遍历每个元素并在找到所有实例后替换它们,但我想看看是否有办法在 findall 将每个元素存储到列表中之前将它们变成字符串。
.findall()
将 return 匹配;您正在寻找元素,而不是寻找包含的文本(这将是一个非常不同的搜索)。
您可以轻松地提取列表理解中的文本:
[elem.get_text() for elem in soup.select('div.dinner p.first')]
我在这里使用了 CSS selector 来匹配 p
标签在其 div
parents.
演示:
>>> from bs4 import BeautifulSoup
>>> html_doc="""
... <html>
... <head>
... <title>The practice html document</title>
... </head>
... <body>
... <div class='dinner'>
... <p class='first'>I like pizza</p>
... <p class='second'>I really like pizza</p>
... <p class='first'>pizza is good</p>
... </div>
... <div class='breakfast'>
... <p class='first'>pancake</p>
... </div>
... <div class='lunch'>
... <p> This is a paragraph</p>
... </div>
... </body>
... </html>
... """
>>> soup = BeautifulSoup(html_doc)
>>> [elem.get_text() for elem in soup.select('div.dinner p.first')]
[u'I like pizza', u'pizza is good']