如何从这种情况下删除带有 python 的 <table> 结构?
How to remove <table> structure with python from this case?
如何使用 python 从 HTML 中删除 "table"?
我遇到过这样的案例:
paragraph = '''
<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit. Quidem molestiae consequuntur officiis corporis sint.<br /><br />
<table>
<tr>
<td>
text title
</td>
<td>
text title 2
</td>
</tr>
</table>
<p> lorem ipsum</p>
'''
如何使用 python 删除上述 table 结构的内容?
我希望产生的输出如下:
paragraph = '''
<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit. Quidem molestiae consequuntur officiis corporis sint.<br /><br />
<p> lorem ipsum</p>
'''
您可以使用 BeautifulSoup
especially PageElement.extract()
In [16]: from bs4 import BeautifulSoup
In [17]: soup = BeautifulSoup("""<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit. Quidem molestiae consequuntur officiis corporis sint.<br /><br />
....: <table>
....: <tr>
....: <td>
....: text title or some
....: </td>
....: </tr>
....: </table>
....: <p> lorem ipsum</p>""")
In [18]: _ = soup.table.extract()
In [19]: soup
Out[19]:
<html><body><p>Lorem ipsum dolor sit amet, consectetur adipisicing elit. Quidem molestiae consequuntur officiis corporis sint.<br/><br/>
</p>
<p> lorem ipsum</p></body></html>
您也可以试试这个基本的字符串格式
paragraph = paragraph[:paragraph.find('<table>')] + # Find the starting letter of '<table>'
paragraph[paragraph.find('</table>')+ # Find the starting letter of </table>
(len('<\table>')+1):] # Add 1 because length starts from zero
print(paragraph)
甚至这项工作也适用于基本的文本提取
使用regex比较复杂,我建议的笨方法:
def remove_table(s):
left_index = s.find('<table>')
if -1 == left_index:
return s
right_index = s.find('</table>', left_index)
return s[:left_index] + remove_table(s[right_index + 8:])
结果中可能会有一些空行。
如何使用 python 从 HTML 中删除 "table"?
我遇到过这样的案例:
paragraph = '''
<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit. Quidem molestiae consequuntur officiis corporis sint.<br /><br />
<table>
<tr>
<td>
text title
</td>
<td>
text title 2
</td>
</tr>
</table>
<p> lorem ipsum</p>
'''
如何使用 python 删除上述 table 结构的内容? 我希望产生的输出如下:
paragraph = '''
<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit. Quidem molestiae consequuntur officiis corporis sint.<br /><br />
<p> lorem ipsum</p>
'''
您可以使用 BeautifulSoup
especially PageElement.extract()
In [16]: from bs4 import BeautifulSoup
In [17]: soup = BeautifulSoup("""<p>Lorem ipsum dolor sit amet, consectetur adipisicing elit. Quidem molestiae consequuntur officiis corporis sint.<br /><br />
....: <table>
....: <tr>
....: <td>
....: text title or some
....: </td>
....: </tr>
....: </table>
....: <p> lorem ipsum</p>""")
In [18]: _ = soup.table.extract()
In [19]: soup
Out[19]:
<html><body><p>Lorem ipsum dolor sit amet, consectetur adipisicing elit. Quidem molestiae consequuntur officiis corporis sint.<br/><br/>
</p>
<p> lorem ipsum</p></body></html>
您也可以试试这个基本的字符串格式
paragraph = paragraph[:paragraph.find('<table>')] + # Find the starting letter of '<table>'
paragraph[paragraph.find('</table>')+ # Find the starting letter of </table>
(len('<\table>')+1):] # Add 1 because length starts from zero
print(paragraph)
甚至这项工作也适用于基本的文本提取
使用regex比较复杂,我建议的笨方法:
def remove_table(s):
left_index = s.find('<table>')
if -1 == left_index:
return s
right_index = s.find('</table>', left_index)
return s[:left_index] + remove_table(s[right_index + 8:])
结果中可能会有一些空行。