如何使用 python 提取 html 数据?
How can i extract html data with python?
<td><img src="/images/cflags/png/id1.png" alt="Indonesia" title="Indonesia"></td>
<td></td>
<td>link.here/python.php
</td>
<td>Linux</td>
<td><img src="/images/cflags/png/id2.png" alt="Indonesia" title="Indonesia"></td>
<td></td>
<td>link2.here/python.php
</td>
<td>Linux</td>
<td><img src="/images/cflags/png/id3.png" alt="Indonesia" title="Indonesia"></td>
<td></td>
<td>link3.here/python.php
</td>
<td>Linux</td>
这是一个代码示例,我想使用 python 提取其中的 link 有谁能帮我吗?
您可以使用 BeautifulSoup
.
如果您所有的链接都以 php
结尾,您可以这样做:
>>> from bs4 import BeautifulSoup
>>> text = '''<td><img src="/images/cflags/png/id1.png" alt="Indonesia" title="Indonesia"></td>
... <td></td>
... <td>link.here/python.php
... </td>
... <td>Linux</td>
... <td><img src="/images/cflags/png/id2.png" alt="Indonesia" title="Indonesia"></td>
... <td></td>
... <td>link2.here/python.php
... </td>
... <td>Linux</td>
... <td><img src="/images/cflags/png/id3.png" alt="Indonesia" title="Indonesia"></td>
... <td></td>
... <td>link3.here/python.php
... </td>
... <td>Linux</td>'''
>>> soup = BeautifulSoup(text, 'html.parser')
>>> [url.text.strip() for url in soup.find_all('td') if url.text.strip().endswith('php')]
['link.here/python.php', 'link2.here/python.php', 'link3.here/python.php']
<td><img src="/images/cflags/png/id1.png" alt="Indonesia" title="Indonesia"></td>
<td></td>
<td>link.here/python.php
</td>
<td>Linux</td>
<td><img src="/images/cflags/png/id2.png" alt="Indonesia" title="Indonesia"></td>
<td></td>
<td>link2.here/python.php
</td>
<td>Linux</td>
<td><img src="/images/cflags/png/id3.png" alt="Indonesia" title="Indonesia"></td>
<td></td>
<td>link3.here/python.php
</td>
<td>Linux</td>
这是一个代码示例,我想使用 python 提取其中的 link 有谁能帮我吗?
您可以使用 BeautifulSoup
.
如果您所有的链接都以 php
结尾,您可以这样做:
>>> from bs4 import BeautifulSoup
>>> text = '''<td><img src="/images/cflags/png/id1.png" alt="Indonesia" title="Indonesia"></td>
... <td></td>
... <td>link.here/python.php
... </td>
... <td>Linux</td>
... <td><img src="/images/cflags/png/id2.png" alt="Indonesia" title="Indonesia"></td>
... <td></td>
... <td>link2.here/python.php
... </td>
... <td>Linux</td>
... <td><img src="/images/cflags/png/id3.png" alt="Indonesia" title="Indonesia"></td>
... <td></td>
... <td>link3.here/python.php
... </td>
... <td>Linux</td>'''
>>> soup = BeautifulSoup(text, 'html.parser')
>>> [url.text.strip() for url in soup.find_all('td') if url.text.strip().endswith('php')]
['link.here/python.php', 'link2.here/python.php', 'link3.here/python.php']