从 HTML 中删除标签,特定标签除外(但保留其内容)
Removing tags from HTML, except specific ones (but keep their contents)
我使用此代码删除 HTML 中的所有标记元素。我需要保留 <br>
和 <br/>
。
所以我使用这个代码:
import re
MyString = 'aaa<p>Radio and<BR> television.<br></p><p>very<br/> popular in the world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'
MyString = re.sub('(?i)(<br\/?>)|<[^>]*>',r'', MyString)
print(MyString)
输出为:
aaaRadio and<BR> television.<br>very<br/> popular in the world today.Millions of people watch TV. That’s because a radio is very small 98.2%and it‘s easy to carry. haha100%bb
结果是对的,但是现在我想保留<p>
and </p>
and <br>
and <br/>
.
如何修改我的代码?
我不确定 regex
是这里的正确解决方案,但既然你问了:
import re
html = html.replace("<p>", "{p}").replace("</p>", "{/p}")
txt = re.sub("<[^>]*>", "", html)
txt = txt.replace("{p}", "<p>").replace("{/p}", "</p>")
我基本上将 p
标签更改为另一个标记,并在删除所有标签后替换回去。
用正则表达式解析 html,通常不是一个好主意。
使用 HTML 解析器比使用正则表达式更可靠。正则表达式不应用于解析嵌套结构,如 HTML.
这是一个有效的实现,它遍历所有 HTML 标签,对于那些不是 p
或 br
的标签,删除它们的标签:
from bs4 import BeautifulSoup
mystring = 'aaa<p>Radio and<BR> television.<br></p><p>very<br/> popular in the world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'
soup = BeautifulSoup(mystring,'html.parser')
for e in soup.find_all():
if e.name not in ['p','br']:
e.unwrap()
print(soup)
输出:
aaa<p>Radio and<br/> television.<br/></p><p>very<br> popular in the world today.</br></p><p>Millions of people watch TV. </p><p>That’s because a radio is very small 98.2%</p><p>and it‘s easy to carry. haha100%</p>bb
现在我知道如何 modify.But 第一个 <p>
不见了。
我的代码:
import re
MyString = 'aaa<p>Radio and<BR> television.<br></p><p>very<br/> popular in the world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'
# MyString = re.sub('(?i)(<br\/?>)|<[^>]*>',r'', MyString)
MyString = re.sub('(?i)(<br\/?>)|<[^>]*>(<\/?p>)|<[^>]*>',r'', MyString)
print(MyString)
输出为:
aaaRadio and<BR> television.<br><p>very<br/> popular in the world today.<p>Millions of people watch TV. <p>That’s because a radio is very small 98.2%</p>and it‘s easy to carry. haha100%</p>bb
我使用此代码删除 HTML 中的所有标记元素。我需要保留 <br>
和 <br/>
。
所以我使用这个代码:
import re
MyString = 'aaa<p>Radio and<BR> television.<br></p><p>very<br/> popular in the world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'
MyString = re.sub('(?i)(<br\/?>)|<[^>]*>',r'', MyString)
print(MyString)
输出为:
aaaRadio and<BR> television.<br>very<br/> popular in the world today.Millions of people watch TV. That’s because a radio is very small 98.2%and it‘s easy to carry. haha100%bb
结果是对的,但是现在我想保留<p>
and </p>
and <br>
and <br/>
.
如何修改我的代码?
我不确定 regex
是这里的正确解决方案,但既然你问了:
import re
html = html.replace("<p>", "{p}").replace("</p>", "{/p}")
txt = re.sub("<[^>]*>", "", html)
txt = txt.replace("{p}", "<p>").replace("{/p}", "</p>")
我基本上将 p
标签更改为另一个标记,并在删除所有标签后替换回去。
用正则表达式解析 html,通常不是一个好主意。
使用 HTML 解析器比使用正则表达式更可靠。正则表达式不应用于解析嵌套结构,如 HTML.
这是一个有效的实现,它遍历所有 HTML 标签,对于那些不是 p
或 br
的标签,删除它们的标签:
from bs4 import BeautifulSoup
mystring = 'aaa<p>Radio and<BR> television.<br></p><p>very<br/> popular in the world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'
soup = BeautifulSoup(mystring,'html.parser')
for e in soup.find_all():
if e.name not in ['p','br']:
e.unwrap()
print(soup)
输出:
aaa<p>Radio and<br/> television.<br/></p><p>very<br> popular in the world today.</br></p><p>Millions of people watch TV. </p><p>That’s because a radio is very small 98.2%</p><p>and it‘s easy to carry. haha100%</p>bb
现在我知道如何 modify.But 第一个 <p>
不见了。
我的代码:
import re
MyString = 'aaa<p>Radio and<BR> television.<br></p><p>very<br/> popular in the world today.</p><p>Millions of people watch TV. </p><p>That’s because a radio is very small <span_style=":_black;">98.2%</span></p><p>and it‘s easy to carry. <span_style=":_black;">haha100%</span></p>bb'
# MyString = re.sub('(?i)(<br\/?>)|<[^>]*>',r'', MyString)
MyString = re.sub('(?i)(<br\/?>)|<[^>]*>(<\/?p>)|<[^>]*>',r'', MyString)
print(MyString)
输出为:
aaaRadio and<BR> television.<br><p>very<br/> popular in the world today.<p>Millions of people watch TV. <p>That’s because a radio is very small 98.2%</p>and it‘s easy to carry. haha100%</p>bb