如何用beautifulsoup抓取繁体中文文本?

How to scrape Traditional Chinese text with beautifulsoup?

我正在使用 beautifulsoup 从该网站抓取中文文本。

有时有效:

http://www.fashionguide.com.tw/Beauty/08/MsgL.asp?LinkTo=TopicL&TopicNum=13976&Absolute=1
Tsaio上山採藥 輕油水感全效UV防曬精華

有时不起作用:

http://www.fashionguide.com.tw/Beauty/08/MsgL.asp?LinkTo=TopicL&TopicNum=13996&Absolute=1
MAYBELLINE´A¤ñµY ³z¥Õ¼á²bªø®Ä¢ã¢ä¯»»æ

当我尝试编码为 utf-8 时:

title1 = tds.find("span",attrs={"class": "style1", "itemprop": "brand"})
title2 = tds.find("span",attrs={"class": "style1", "itemprop": "name"})
print ((title1.text + title2.text).encode('utf-8'))

我得到:

b'MAYBELLINE\xc2\xb4A\xc2\xa4\xc3\xb1\xc2\xb5Y \xc2\xb3z\xc2\xa5\xc3\x95\xc2\xbc\xc3\xa1\xc2\xb2b\xc2\xaa\xc3\xb8\xc2\xae\xc3\x84\xc2\xa2\xc3\xa3\xc2\xa2\xc3\xa4\xc2\xaf\xc2\xbb\xc2\xbb\xc3\xa6'

怎样才能得到正确的中文文本?

编辑: 我刚切换到python3,所以我可能犯了一些错误。这就是我如何抓住 html:

contentb = urllib.request.urlopen(urlb).read()
soupb = BeautifulSoup(contentb)

正如您正确注意到的那样,默认 BS 解析器在这种情况下不起作用。还明确使用 Big5(在 html 中声明的字符集)。

但是你应该使用 lxml + BeautifulSoup 来完成你的工作,注意用字节而不是 unicode 初始化你的汤。

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

http://docs.python-requests.org/en/latest/api/#requests.Response.content

from bs4 import BeautifulSoup
import requests

base_url = '?'.join(['http://www.fashionguide.com.tw/Beauty/08/MsgL.asp',
                     'LinkTo=TopicL&TopicNum={topic}&Absolute=1'])
topics = [13976, 13996, ]

for t in topics:
    url = base_url.format(topic=t)
    page_content = requests.get(url).content  # returns bytes
    soup = BeautifulSoup(page_content, 'lxml')
    title1 = soup.find("span", attrs={"class": "style1", "itemprop": "brand"})
    title2 = soup.find("span", attrs={"class": "style1", "itemprop": "name"})
    print(title1.text + title2.text)

这是使用 xpath 的相同解决方案,我更喜欢:-)

from lxml import html
import requests

base_url = '?'.join(['http://www.fashionguide.com.tw/Beauty/08/MsgL.asp',
                     'LinkTo=TopicL&TopicNum={topic}&Absolute=1'])
topics = [13976, 13996, ]

xp1 = "//*[@itemprop='brand']/text()"
xp2 = "//*[@itemprop='brand']/following-sibling::span[1]/text()"

for t in topics:
    url = base_url.format(topic=t)
    page_content = requests.get(url).content
    tree = html.fromstring(page_content)
    title1 = tree.xpath(xp1)  # returns a list!
    title2 = tree.xpath(xp2)
    title = " ".join(title1 + title2)
    print(title)