在 python 中删除纯文本以外的任何内容

Question

我试图让代码只获取 <p> 标签之间的所有内容。还没找到方法

我试过使用一个简单的循环，这个程序你假设输入一个 url 并且当你运行它显示纯文本。

    import urllib.request
    import urllib.parse
    import re

    print("Enter the URL")
    url = input()

    #url = "https://en.wikipedia.org/wiki/Somalia"
    values = {'s':'basic', 'submit':'search'}
    data = urllib.parse.urlencode(values)
    data = data.encode('utf-8')
    req = urllib.request.Request(url,data)
    resp = urllib.request.urlopen(req)
    respData = resp.read()

    #print(respData)

    paragraphs = re.findall(r'<p>(.*?)</p>', str(respData))

    for eachP in paragraphs:
        print(eachP)

我也尝试过使用 BeutifulSoup，但还没有导入它。

Answer 1

欢迎使用 SO 和编程。 You can't parse [X]HTML with regex. Time to use libraries. Beautiful Soup and is your requests 是你最好的朋友。

在您的 bash/cmd/terminal 中输入：

pip install requests
pip install beautifulsoup4

然后使用：

import requests
from bs4 import BeautifulSoup


r = requests.get("https://en.wikipedia.org/wiki/Somalia")
soup = BeautifulSoup(r.text) # you need to define the parser but for now its ok.
for p in soup.find_all('p'):
    print(p.text)

在 python 中删除纯文本以外的任何内容

Deleting anything but plain Text in python

python

urllib