Encode() 在所有情况下都不起作用

Encode() not working in all cases

我正在使用 Beautiful Soup 4 扫描 html 文件并提取某些特征。具体来说,我用它来查找足球运动员的名字、俱乐部、联赛、统计数据等。由于许多球员和俱乐部的名字都有重音符号,我正在寻找一种方法来打印这些重音符号,而不是看到像 [=26 这样的输出=] 我能够通过使用

使其工作
# open html page
fut_page = open('futhead1.html','r')
# read content from html page
fut_read = fut_page.read()
# html parsed page
fut_soup = BeautifulSoup(fut_read, "html.parser")
# grabs all players
players = fut_soup.findAll('li',{'class':'list-group-item list-group-table-row player-group-item dark-hover'})
player = players[2]
# name_tag contains tag with player's name
name_tag = player.find("span",{"class":"player-name"})
# extract just the player's name
player_name = name_tag.text
print player_name.encode('utf-8')

这会打印出正确的球员姓名:"Kaká"但是,例如,使用正则表达式提取俱乐部名称时,我没有看到相同的结果

regex_club = re.compile(ur'\[.*?</strong>\n\s+\|\s\n\s+(.*?)\n', re.MULTILINE)
# extract club name
player_club = re.match(regex_club, str(pos_clb_lge_tag))
print player_club.group(1).encode('utf-8')

此代码可以打印出正确的俱乐部名称,例如 "Atl\xe9tico Madrid",但 encode() 无法删除“\xe9”并将其替换为“é”

下面是 html 文件中我应用正则表达式的部分

<li class="list-group-item list-group-table-row player-group-item dark-hover">
<div class="content player-item font-24">
    <a class="display-block padding-0" href="/fifa-mobile/17/players/33194/jan-oblak/">
        <span class="player-rating stream-col-50 text-center">
            <span class="revision-gradient shadowed font-12 fut elite">100</span>
        </span>
        <span class="player-info">
            <img class="player-image" src="http://futhead.cursecdn.com/static/img/fm/17/players/200389_SASC.png">
            <img class="player-program" src="http://futhead.cursecdn.com/static/img/fm/17/resources/program_17_VSATTACK.png">
            <span class="player-name">Jan Oblak</span>
            <span class="player-club-league-name">
                <strong>GK</strong>
                 | 
                Atlético Madrid
                 | 
                LaLiga Santander
            </span>
        </span>

        <span class="player-right text-center hidden-xs">
            <span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">83</span><span class="hover-label">PAC</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">50</span><span class="hover-label">SHO</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">66</span><span class="hover-label">PAS</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">55</span><span class="hover-label">DRI</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">58</span><span class="hover-label">DEF</span></span><span class="player-stat stream-col-60 hidden-md hidden-sm"><span class="value">85</span><span class="hover-label">PHY</span></span><span class="player-stat stream-col-60 font-12 font-medium text-upper">35</span>
        </span>
        <span class="player-right slide hidden-sm hidden-xs" data-direction="right" data-max="-482px">
            <span class="slide-content text-upper">
                <span class="trigger icon icon-dots-three-horizontal"></span>


                <span class="player-stat stream-col-80">
                    <span class="value">+2</span>
                    <span class="hover-label">MRK</span>
                </span>


                <span class="player-stat stream-col-80">
                    <span class="value">+1</span>
                    <span class="hover-label">OVR</span>
                </span>

                <span class="player-stat stream-col-100"><span class="value">right</span><span class="hover-label">Strong Foot</span></span>
                <span class="player-stat stream-col-100"><span class="value">18<span class="icon icon-star gold margin-l-4"></span></span><span class="hover-label">Weak Foot</span></span>
            </span>
        </span>

    </a>
</div>

所以基本上,当我在中间使用正则表达式时,为什么 encode() 不起作用?如果需要进一步说明,请告诉我。谢谢。

我怀疑您没有显示所有代码(参见 [mcve]),但是在 Unicode 对象上调用 str 是错误的做法,应该给出:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 40: ordinal not in range(128)

我怀疑您已经完成了 setdefaultencoding,这是 bad habit

str() 所做的是将 Unicode 字符串转换为带有转义码文本的字节字符串,例如'\n'(两个字符)而不是 '\n'(一个字符),它对非 ascii 字符也是如此。

如果您的终端配置正确,您在打印时也不必手动对最终结果进行编码。

这是一个使用 BeautifulSoup 仅检索要解析的文本的工作示例:

from  bs4 import BeautifulSoup
import re

# open html page
fut_page = open('futhead1.html','r')
# read content from html page
fut_read = fut_page.read()
# html parsed page
fut_soup = BeautifulSoup(fut_read, "html.parser")
# grabs all players
players = fut_soup.findAll('li',{'class':'list-group-item list-group-table-row player-group-item dark-hover'})
player = players[0]
# name_tag contains tag with player's name
name_tag = player.find("span",{"class":"player-club-league-name"})
# extract just the player's name
pos_clb_lge_tag = name_tag.contents[-1]
regex_club = re.compile(ur'\n\s+\|\s\n\s+(.*?)\n')
# extract club name
player_club = regex_club.match(pos_clb_lge_tag)
print player_club.group(1)

Atlético Madrid