如何用beautifulsoup抓取繁体中文文本?
How to scrape Traditional Chinese text with beautifulsoup?
我正在使用 beautifulsoup 从该网站抓取中文文本。
有时有效:
http://www.fashionguide.com.tw/Beauty/08/MsgL.asp?LinkTo=TopicL&TopicNum=13976&Absolute=1
Tsaio上山採藥 輕油水感全效UV防曬精華
有时不起作用:
http://www.fashionguide.com.tw/Beauty/08/MsgL.asp?LinkTo=TopicL&TopicNum=13996&Absolute=1
MAYBELLINE´A¤ñµY ³z¥Õ¼á²bªø®Ä¢ã¢ä¯»»æ
当我尝试编码为 utf-8 时:
title1 = tds.find("span",attrs={"class": "style1", "itemprop": "brand"})
title2 = tds.find("span",attrs={"class": "style1", "itemprop": "name"})
print ((title1.text + title2.text).encode('utf-8'))
我得到:
b'MAYBELLINE\xc2\xb4A\xc2\xa4\xc3\xb1\xc2\xb5Y \xc2\xb3z\xc2\xa5\xc3\x95\xc2\xbc\xc3\xa1\xc2\xb2b\xc2\xaa\xc3\xb8\xc2\xae\xc3\x84\xc2\xa2\xc3\xa3\xc2\xa2\xc3\xa4\xc2\xaf\xc2\xbb\xc2\xbb\xc3\xa6'
怎样才能得到正确的中文文本?
编辑:
我刚切换到python3,所以我可能犯了一些错误。这就是我如何抓住 html:
contentb = urllib.request.urlopen(urlb).read()
soupb = BeautifulSoup(contentb)
正如您正确注意到的那样,默认 BS 解析器在这种情况下不起作用。还明确使用 Big5(在 html 中声明的字符集)。
但是你应该使用 lxml + BeautifulSoup 来完成你的工作,注意用字节而不是 unicode 初始化你的汤。
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
http://docs.python-requests.org/en/latest/api/#requests.Response.content
from bs4 import BeautifulSoup
import requests
base_url = '?'.join(['http://www.fashionguide.com.tw/Beauty/08/MsgL.asp',
'LinkTo=TopicL&TopicNum={topic}&Absolute=1'])
topics = [13976, 13996, ]
for t in topics:
url = base_url.format(topic=t)
page_content = requests.get(url).content # returns bytes
soup = BeautifulSoup(page_content, 'lxml')
title1 = soup.find("span", attrs={"class": "style1", "itemprop": "brand"})
title2 = soup.find("span", attrs={"class": "style1", "itemprop": "name"})
print(title1.text + title2.text)
这是使用 xpath 的相同解决方案,我更喜欢:-)
from lxml import html
import requests
base_url = '?'.join(['http://www.fashionguide.com.tw/Beauty/08/MsgL.asp',
'LinkTo=TopicL&TopicNum={topic}&Absolute=1'])
topics = [13976, 13996, ]
xp1 = "//*[@itemprop='brand']/text()"
xp2 = "//*[@itemprop='brand']/following-sibling::span[1]/text()"
for t in topics:
url = base_url.format(topic=t)
page_content = requests.get(url).content
tree = html.fromstring(page_content)
title1 = tree.xpath(xp1) # returns a list!
title2 = tree.xpath(xp2)
title = " ".join(title1 + title2)
print(title)
我正在使用 beautifulsoup 从该网站抓取中文文本。
有时有效:
http://www.fashionguide.com.tw/Beauty/08/MsgL.asp?LinkTo=TopicL&TopicNum=13976&Absolute=1
Tsaio上山採藥 輕油水感全效UV防曬精華
有时不起作用:
http://www.fashionguide.com.tw/Beauty/08/MsgL.asp?LinkTo=TopicL&TopicNum=13996&Absolute=1
MAYBELLINE´A¤ñµY ³z¥Õ¼á²bªø®Ä¢ã¢ä¯»»æ
当我尝试编码为 utf-8 时:
title1 = tds.find("span",attrs={"class": "style1", "itemprop": "brand"})
title2 = tds.find("span",attrs={"class": "style1", "itemprop": "name"})
print ((title1.text + title2.text).encode('utf-8'))
我得到:
b'MAYBELLINE\xc2\xb4A\xc2\xa4\xc3\xb1\xc2\xb5Y \xc2\xb3z\xc2\xa5\xc3\x95\xc2\xbc\xc3\xa1\xc2\xb2b\xc2\xaa\xc3\xb8\xc2\xae\xc3\x84\xc2\xa2\xc3\xa3\xc2\xa2\xc3\xa4\xc2\xaf\xc2\xbb\xc2\xbb\xc3\xa6'
怎样才能得到正确的中文文本?
编辑: 我刚切换到python3,所以我可能犯了一些错误。这就是我如何抓住 html:
contentb = urllib.request.urlopen(urlb).read()
soupb = BeautifulSoup(contentb)
正如您正确注意到的那样,默认 BS 解析器在这种情况下不起作用。还明确使用 Big5(在 html 中声明的字符集)。
但是你应该使用 lxml + BeautifulSoup 来完成你的工作,注意用字节而不是 unicode 初始化你的汤。
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
http://docs.python-requests.org/en/latest/api/#requests.Response.content
from bs4 import BeautifulSoup
import requests
base_url = '?'.join(['http://www.fashionguide.com.tw/Beauty/08/MsgL.asp',
'LinkTo=TopicL&TopicNum={topic}&Absolute=1'])
topics = [13976, 13996, ]
for t in topics:
url = base_url.format(topic=t)
page_content = requests.get(url).content # returns bytes
soup = BeautifulSoup(page_content, 'lxml')
title1 = soup.find("span", attrs={"class": "style1", "itemprop": "brand"})
title2 = soup.find("span", attrs={"class": "style1", "itemprop": "name"})
print(title1.text + title2.text)
这是使用 xpath 的相同解决方案,我更喜欢:-)
from lxml import html
import requests
base_url = '?'.join(['http://www.fashionguide.com.tw/Beauty/08/MsgL.asp',
'LinkTo=TopicL&TopicNum={topic}&Absolute=1'])
topics = [13976, 13996, ]
xp1 = "//*[@itemprop='brand']/text()"
xp2 = "//*[@itemprop='brand']/following-sibling::span[1]/text()"
for t in topics:
url = base_url.format(topic=t)
page_content = requests.get(url).content
tree = html.fromstring(page_content)
title1 = tree.xpath(xp1) # returns a list!
title2 = tree.xpath(xp2)
title = " ".join(title1 + title2)
print(title)