Python 中具有特殊字符的字符串未正确显示

Question

我已经使用 BeautifulSoup 将网站上的一些文本（城市名称）解析到列表中，但是遇到了一个我无法克服的问题。网站上的文本元素有特殊字符，当我打印列表时，城市名称显示为 [u'London] 并且出现数字和字母而不是特殊字符。如何去掉开头的 'u' 并将文本转换为网站上最初显示的格式？

代码如下：

import urllib2
from bs4 import BeautifulSoup

address = 'https://clinicaltrials.gov/ct2/show/NCT02226120?resultsxml=true'

page = urllib2.urlopen(address)
soup = BeautifulSoup(page)
locations = soup.findAll('country', text="Hungary")
for city_tag in locations:
    site=city_tag.parent.name
    if site=="address":
        desired_city=str(city_tag.findPreviousSibling('city').contents)
        print desired_city

这是我得到的输出：

[u'Pecs']
[u'Baja']
[u'Balatonfured']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Cegled']
[u'Debrecen']
[u'Eger']
[u'Hodmezovasarhely']
[u'Miskolc']
[u'Nagykanizsa']
[u'Nyiregyh\xe1za']
[u'Pecs']
[u'Sopron']
[u'Szeged']
[u'Szekesfehervar']
[u'Szekszard']
[u'Zalaegerszeg']

例如倒数第 7 个元素 [u'Nyiregyh\xe1za'] 显示不正确。

Answer 1

您使用 str() 转换了您拥有的对象以便打印：

    desired_city=str(city_tag.findPreviousSibling('city').contents)
    print desired_city

您不仅会看到您询问的 'u' 前缀，还会看到 [] 和 ''。这些标点符号是 str() 如何将这些类型的对象转换为文本的一部分：[] 表示您有一个列表对象。 u'' 表示列表中的对象是 "text"。注意：Python2 在处理字节和字符方面相当草率。这种草率让很多人感到困惑，尤其是因为有时它似乎可以工作，即使它是错误的并且在其他数据或环境中失败。

由于您有一个包含 unicode 对象的列表，因此您想要打印该值：

    list_of_cities = city_tag.findPreviousSibling('city').contents
    desired_city = list_of_cities[0]
    print desired_city

请注意，我假设城市列表至少包含一个元素。您显示的示例输出就是这样，但最好也检查错误情况。

Python 中具有特殊字符的字符串未正确显示

String with special characters in Python do not appear correctly

python

unicode

beautifulsoup

special-characters