网页抓取 python3.4 提取一段
web scraping python3.4 extract a paragraph
我使用 requests 和 bs4 从网页抓取数据
我有一个字符串,其中包含网页段落中的几个单词,我想知道如何提取包含它的整个段落。如果有人知道怎么做,请告诉我!谢谢 :)
显而易见的方法是遍历所有段落并找到包含您的文字的段落:
for p in soup.find_all('p'):
if few_words in p.text:
# found it, do something
这里有一些非常简单的案例,在网络抓取时非常有用。这部分回答了你的问题,但由于你没有提供更多信息,我的数据和方法充其量只是假设。
from bs4 import BeautifulSoup as bsoup
import re
html = """
<span>
<div id="foo">
The quick brown fox jumped
</div>
<p id="bar">
over the lazy dog.
</p>
</span>
"""
soup = bsoup(html)
soup.prettify()
# Find the div with id "foo" and get
# its inner text and print it.
foo = soup.find_all(id="foo")
f = foo[0].get_text()
print f
print "-" * 50
# Find the p with id "bar", get its
# inner text, strip all whitespace,
# and print it out.
bar = soup.find_all(id="bar")
b = bar[0].get_text().strip()
print b
print "-" * 50
# Find the word "lazy". Get its parent
# tag. If it's a p tag, get that p tag's
# parent, then get all the text inside that
# parent, strip all extra spaces, and print.
lazy = soup.find_all(text=re.compile("lazy"))
lazy_tag = lazy[0].parent
if lazy_tag.name == "p":
lazy_grandparent = lazy_tag.parent
all_text = lazy_grandparent.get_text()
all_text = " ".join(all_text.split())
print all_text
结果:
The quick brown fox jumped
--------------------------------------------------
over the lazy dog.
--------------------------------------------------
The quick brown fox jumped over the lazy dog.
for para in request_soup.p.find_all(text=True,recursive=True):
即使 <p>
标签之前有任何标签,您也可以使用它来提取段落
我使用 requests 和 bs4 从网页抓取数据 我有一个字符串,其中包含网页段落中的几个单词,我想知道如何提取包含它的整个段落。如果有人知道怎么做,请告诉我!谢谢 :)
显而易见的方法是遍历所有段落并找到包含您的文字的段落:
for p in soup.find_all('p'):
if few_words in p.text:
# found it, do something
这里有一些非常简单的案例,在网络抓取时非常有用。这部分回答了你的问题,但由于你没有提供更多信息,我的数据和方法充其量只是假设。
from bs4 import BeautifulSoup as bsoup
import re
html = """
<span>
<div id="foo">
The quick brown fox jumped
</div>
<p id="bar">
over the lazy dog.
</p>
</span>
"""
soup = bsoup(html)
soup.prettify()
# Find the div with id "foo" and get
# its inner text and print it.
foo = soup.find_all(id="foo")
f = foo[0].get_text()
print f
print "-" * 50
# Find the p with id "bar", get its
# inner text, strip all whitespace,
# and print it out.
bar = soup.find_all(id="bar")
b = bar[0].get_text().strip()
print b
print "-" * 50
# Find the word "lazy". Get its parent
# tag. If it's a p tag, get that p tag's
# parent, then get all the text inside that
# parent, strip all extra spaces, and print.
lazy = soup.find_all(text=re.compile("lazy"))
lazy_tag = lazy[0].parent
if lazy_tag.name == "p":
lazy_grandparent = lazy_tag.parent
all_text = lazy_grandparent.get_text()
all_text = " ".join(all_text.split())
print all_text
结果:
The quick brown fox jumped
--------------------------------------------------
over the lazy dog.
--------------------------------------------------
The quick brown fox jumped over the lazy dog.
for para in request_soup.p.find_all(text=True,recursive=True):
即使 <p>
标签之前有任何标签,您也可以使用它来提取段落