使用 beautifulsoup 替换 table 内容
Replacing table content using beautifulsoup
我想使用 beautiful soup 解析一个 HTML 文档,其中也包含表格数据。我正在对它做一些 NLP。
table 单元格可能只有数字,也可能是大量文本。所以在执行 soup.get_text() 之前,我希望根据以下条件更改表格数据的内容。
条件:如果单元格有两个以上的单词(我们可以认为一个数字是一个单词),则只保留它,否则将单元格内容更改为空字符串。
<code to change table data based on condition>
soup = BeautifulSoup(html)
text = soup.get_text()
这是我试过的东西。
tables = soup.find_all('table')
for table in tables:
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
for ele in cols:
if len(ele.text.split(' ')<3):
ele.text = ''
但是,我们不能设置 ele.text 所以它会抛出一个错误。
这是一个简单的 HTML 结构 Table
<!DOCTYPE html>
<html>
<head>
<title>HTML Tables</title>
</head>
<body>
<table border = "1">
<tr>
<td><p><span>Row 1, Column 1, This should be kept because it has more than two tokens</span></p></td>
<td><p><span>not kept</span></p></td>
</tr>
<tr>
<td><p><span>Row 2, Column 1, should be kept</span></p></td>
<td><p><span>Row 2, Column 2, should be kept</span></p></td>
</tr>
</table>
</body>
</html>
找到元素后使用 ele.string.replace_with("")
基于您的样本html
html='''<html>
<head>
<title>HTML Tables</title>
</head>
<body>
<table border = "1">
<tr>
<td><p><span>Row 1, Column 1, This should be kept because it has more than two tokens</span></p></td>
<td><p><span>not kept</span></p></td>
</tr>
<tr>
<td><p><span>Row 2, Column 1, should be kept</span></p></td>
<td><p><span>Row 2, Column 2, should be kept</span></p></td>
</tr>
</table>
</body>
</html>'''
soup=BeautifulSoup(html,'html.parser')
tables = soup.find_all('table')
for table in tables:
rows = table.find_all('tr')
for row in rows:
cols = row.find_all('td')
for ele in cols:
if len(ele.text.split(' '))<3:
ele.string.replace_with("")
print(soup)
输出:
<html>
<head>
<title>HTML Tables</title>
</head>
<body>
<table border="1">
<tr>
<td><p><span>Row 1, Column 1, This should be kept because it has more than two tokens</span></p></td>
<td><p><span></span></p></td>
</tr>
<tr>
<td><p><span>Row 2, Column 1, should be kept</span></p></td>
<td><p><span>Row 2, Column 2, should be kept</span></p></td>
</tr>
</table>
</body>
</html>
我想使用 beautiful soup 解析一个 HTML 文档,其中也包含表格数据。我正在对它做一些 NLP。
table 单元格可能只有数字,也可能是大量文本。所以在执行 soup.get_text() 之前,我希望根据以下条件更改表格数据的内容。
条件:如果单元格有两个以上的单词(我们可以认为一个数字是一个单词),则只保留它,否则将单元格内容更改为空字符串。
<code to change table data based on condition>
soup = BeautifulSoup(html)
text = soup.get_text()
这是我试过的东西。
tables = soup.find_all('table')
for table in tables:
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
for ele in cols:
if len(ele.text.split(' ')<3):
ele.text = ''
但是,我们不能设置 ele.text 所以它会抛出一个错误。
这是一个简单的 HTML 结构 Table
<!DOCTYPE html>
<html>
<head>
<title>HTML Tables</title>
</head>
<body>
<table border = "1">
<tr>
<td><p><span>Row 1, Column 1, This should be kept because it has more than two tokens</span></p></td>
<td><p><span>not kept</span></p></td>
</tr>
<tr>
<td><p><span>Row 2, Column 1, should be kept</span></p></td>
<td><p><span>Row 2, Column 2, should be kept</span></p></td>
</tr>
</table>
</body>
</html>
找到元素后使用 ele.string.replace_with("")
基于您的样本html
html='''<html>
<head>
<title>HTML Tables</title>
</head>
<body>
<table border = "1">
<tr>
<td><p><span>Row 1, Column 1, This should be kept because it has more than two tokens</span></p></td>
<td><p><span>not kept</span></p></td>
</tr>
<tr>
<td><p><span>Row 2, Column 1, should be kept</span></p></td>
<td><p><span>Row 2, Column 2, should be kept</span></p></td>
</tr>
</table>
</body>
</html>'''
soup=BeautifulSoup(html,'html.parser')
tables = soup.find_all('table')
for table in tables:
rows = table.find_all('tr')
for row in rows:
cols = row.find_all('td')
for ele in cols:
if len(ele.text.split(' '))<3:
ele.string.replace_with("")
print(soup)
输出:
<html>
<head>
<title>HTML Tables</title>
</head>
<body>
<table border="1">
<tr>
<td><p><span>Row 1, Column 1, This should be kept because it has more than two tokens</span></p></td>
<td><p><span></span></p></td>
</tr>
<tr>
<td><p><span>Row 2, Column 1, should be kept</span></p></td>
<td><p><span>Row 2, Column 2, should be kept</span></p></td>
</tr>
</table>
</body>
</html>