£ 在 urllib2 和 Beautiful Soup 中显示

Question

我正在尝试在 python 中编写一个小型网络抓取工具，我认为我运行遇到了编码问题。我正在尝试抓取 http://www.resident-music.com/tickets（特别是页面上的 table）- 一行可能看起来像这样 -

    <tr>
        <td style="width:64.9%;height:11px;">
         <p><strong>the great escape 2017&nbsp; local early bird tickets, selling fast</strong></p>
        </td>
        <td style="width:13.1%;height:11px;">
         <p><strong>18<sup>th</sup>&ndash; 20<sup>th</sup> may</strong></p>
        </td>
        <td style="width:15.42%;height:11px;">
         <p><strong>various</strong></p>
        </td>
        <td style="width:6.58%;height:11px;">
         <p><strong>&pound;55.00</strong></p>
        </td>
       </tr>

我基本上是想用 55 英镑和任何其他 'non-text' 脏东西代替 £55.00。

我已经尝试了一些不同的编码方法，您可以使用 beautifulsoup 和 urllib2 - 无济于事，我想我只是做错了。

谢谢

Answer 1

我为此使用了 requests，但希望您也可以使用 urllib2 来做到这一点。所以这是代码：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import requests 
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(requests.get('your_url').text)
chart = soup.findAll(name='tr') 
print str(chart).replace('&pound;',unichr(163)) #replace '&pound;' with '£'

现在你应该得到预期的输出！

示例输出：

...
<strong>£71.50</strong></p>
...

无论如何，您可以通过多种方式进行解析，这里有趣的是：print str(chart).replace('£',unichr(163)) 这非常具有挑战性:)

Update

如果您想转义多个（甚至一个）字符（如破折号、磅等...），使用 parser 作为 easier/more 将是高效的在帕德莱克的回答中。有时您还会阅读他们处理的评论和其他编码问题。

Answer 2

您想 取消转义 html 可以使用 html.unescape 在 python3:

In [14]: from html import unescape

In [15]: h = """<tr>
   ....:         <td style="width:64.9%;height:11px;">
   ....:          <p><strong>the great escape 2017&nbsp; local early bird tickets, selling fast</strong></p>
   ....:         </td>
   ....:         <td style="width:13.1%;height:11px;">
   ....:          <p><strong>18<sup>th</sup>&ndash; 20<sup>th</sup> may</strong></p>
   ....:         </td>
   ....:         <td style="width:15.42%;height:11px;">
   ....:          <p><strong>various</strong></p>
   ....:         </td>
   ....:         <td style="width:6.58%;height:11px;">
   ....:          <p><strong>&pound;55.00</strong></p>
   ....:         </td>
   ....:        </tr>"""

In [16]: 

In [16]: print(unescape(h))
<tr>
        <td style="width:64.9%;height:11px;">
         <p><strong>the great escape 2017  local early bird tickets, selling fast</strong></p>
        </td>
        <td style="width:13.1%;height:11px;">
         <p><strong>18<sup>th</sup>– 20<sup>th</sup> may</strong></p>
        </td>
        <td style="width:15.42%;height:11px;">
         <p><strong>various</strong></p>
        </td>
        <td style="width:6.58%;height:11px;">
         <p><strong>£55.00</strong></p>
        </td>
       </tr>

对于 python2 使用：

In [6]: from html.parser import HTMLParser

In [7]: unescape = HTMLParser().unescape  

In [8]: print(unescape(h))
<tr>
        <td style="width:64.9%;height:11px;">
         <p><strong>the great escape 2017  local early bird tickets, selling fast</strong></p>
        </td>
        <td style="width:13.1%;height:11px;">
         <p><strong>18<sup>th</sup>– 20<sup>th</sup> may</strong></p>
        </td>
        <td style="width:15.42%;height:11px;">
         <p><strong>various</strong></p>
        </td>
        <td style="width:6.58%;height:11px;">
         <p><strong>£55.00</strong></p>
        </td>

您可以看到两者都正确地对所有实体进行了转义，而不仅仅是井号。

£ 在 urllib2 和 Beautiful Soup 中显示

£ displaying in urllib2 and Beautiful Soup

python

encoding

urllib2

beautifulsoup

£ 在 urllib2 和 Beautiful Soup 中显示

&pound; displaying in urllib2 and Beautiful Soup

python

encoding

urllib2

beautifulsoup

£ displaying in urllib2 and Beautiful Soup