如何删除 BeautifulSoup 中所有不同的脚本标签？

Question

我从网络 link 中抓取 table 并想通过删除所有脚本标签来重建 table。这是源代码。

response = requests.get(url)
soup = BeautifulSoup(response.text)
table = soup.find('table')

for row in table.find_all('tr') :                                                                                                                                                                                                                                                                                                                                                                                                     
    for col in row.find_all('td'):
        #remove all different script tags
        #col.replace_with('') 
        #col.decompose()  
        #col.extract()
        col = col.contents

如何删除所有不同的脚本标签？以关注单元格为例，其中包括标签a、br和td.

<td><a href="http://www.irit.fr/SC">Signal et Communication</a>
<br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
</td>

我的预期结果是：

Signal et Communication
Ingénierie Réseaux et Télécommunications

Answer 1

尝试拨打 col.string。那只会给你文字。

Answer 2

你问的是get_text():

If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string

td = soup.find("td")
td.get_text()

请注意，在这种情况下，.string 会 return 你 None 因为 td 有 多个 children :

If a tag contains more than one thing, then it’s not clear what .string should refer to, so .string is defined to be None

演示：

>>> from bs4 import BeautifulSoup
>>> 
>>> soup = BeautifulSoup(u"""
... <td><a href="http://www.irit.fr/SC">Signal et Communication</a>
... <br/><a href="http://www.irit.fr/IRT">Ingénierie Réseaux et Télécommunications</a>
... </td>
... """)
>>> 
>>> td = soup.td
>>> print td.string
None
>>> print td.get_text()
Signal et Communication
Ingénierie Réseaux et Télécommunications

如何删除 BeautifulSoup 中所有不同的脚本标签？

How can I remove all different script tags in BeautifulSoup?

html

python

beautifulsoup

html-parsing