网页抓取 python3.4 提取一段

Question

我使用 requests 和 bs4 从网页抓取数据我有一个字符串，其中包含网页段落中的几个单词，我想知道如何提取包含它的整个段落。如果有人知道怎么做，请告诉我！谢谢 :)

Answer 1

显而易见的方法是遍历所有段落并找到包含您的文字的段落：

for p in soup.find_all('p'):
    if few_words in p.text:
        # found it, do something

Answer 2

这里有一些非常简单的案例，在网络抓取时非常有用。这部分回答了你的问题，但由于你没有提供更多信息，我的数据和方法充其量只是假设。

from bs4 import BeautifulSoup as bsoup
import re

html = """
<span>
    <div id="foo">
        The quick brown fox jumped
    </div>
    <p id="bar">
        over the lazy dog.
    </p>
</span>
"""

soup = bsoup(html)
soup.prettify()

# Find the div with id "foo" and get
# its inner text and print it.

foo = soup.find_all(id="foo")
f = foo[0].get_text()
print f

print "-" * 50

# Find the p with id "bar", get its
# inner text, strip all whitespace,
# and print it out.

bar = soup.find_all(id="bar")
b = bar[0].get_text().strip()
print b

print "-" * 50

# Find the word "lazy". Get its parent
# tag. If it's a p tag, get that p tag's
# parent, then get all the text inside that
# parent, strip all extra spaces, and print.
lazy = soup.find_all(text=re.compile("lazy"))
lazy_tag = lazy[0].parent

if lazy_tag.name == "p":
    lazy_grandparent = lazy_tag.parent
    all_text = lazy_grandparent.get_text()
    all_text = " ".join(all_text.split())
    print all_text

结果：

        The quick brown fox jumped

--------------------------------------------------
over the lazy dog.
--------------------------------------------------
The quick brown fox jumped over the lazy dog.

Answer 3

for para in request_soup.p.find_all(text=True,recursive=True):

即使 <p> 标签之前有任何标签，您也可以使用它来提取段落

网页抓取 python3.4 提取一段

web scraping python3.4 extract a paragraph

python

python-3.4