Unicode 网页抓取

Question

我正在从 barney 网站上抓取项目 ID，我在从输出中删除 unicode 时遇到问题。例如，我想获取商品 ID 为 503777359，但商品 ID 的输出是 [u '503777359']。我希望输出是这样的：- 503777359。我该怎么办？

d3 包含："Fairfax navy and white Glenn plaid cotton poplin dress shirt.Spread collar, single-button barrel cuffs, shoulder yoke and vertical darting at back, shirttail hem, mother-of-pearl buttonsAvailable in Navy/WhiteCottonMachine washMade in JapanOur model is 6'1"/185cm，所穿尺码 15.5。样式 # 503777359

    d2=item.find("div",{"class":"panel-body standard-p"})
    d3=d2.text
    print d3
    p_id = re.findall(r'[0-9]{9}',d3)
    print p_id

Answer 1

只需将您的 [list] 结果转储到这样的变量中：

d2=item.find("div",{"class":"panel-body standard-p"})
d3=d2.text
print d3
p_id = re.findall(r'[0-9]{9}',d3)
idICareAbout = p_id[0]

当然，您可以获取相同的来源并查找

"<meta property="product:retailer_part_no" content="503777359" />"

只得到一个结果的ID。

希望对您有所帮助！

Answer 2

AFAIK 如果字符串不包含奇怪的字符 a.k.a 代码点 128 或更高，可以使用 str() 轻松将其转换为 ascii。这不是 unicode 抓取。您正在打印列表的内容。例如，

p_id=[u'503777359']

print p_id

[u'503777359']

p_id=[str(u'503777359')]

print p_id

['503777359']

如你所见，"u" 神奇地消失了。

Unicode 网页抓取

Unicode Web Scraping

python

unicode

beautifulsoup

web-scraping