需要有关网络抓取中字符串匹配的帮助，python

Question

我尝试从网页中提取一些内容。首先，我使用 BeautifulSoup 提取了一个名为 "scores" 的 div，其中包含多个像这样的图像

<img class="sprite-rating_s_fill rating_s_fill s45" src="http://e2.tacdn.com/img2/x.gif" alt="4.5 of 5 stars">

我想提取这张图片的分数，在本例中是“4.5”。所以我尝试这样做：

pattern = re.compile('<img.*?alt="(.*?) of 5 stars">', re.S)
items = re.findall(pattern, scores)

但是不行。我是网络抓取的新手，所以有人可以帮助我吗？

Answer 1

BeautifulSoup 实际上使得从标签中提取信息变得非常容易！假设 scores 是一个 BeautifulSoup Tag 对象（您可以阅读 in their documentation），您要做的是从标签中提取 src 属性:

src = scores['src']

你刚才举的例子，src应该是u'4.5 out of 5 stars'。现在你只需要去掉 ' out of 5 stars':

removeIndex = src.index(' out of 5 stars')
score = src[:removeIndex]

你将得到 score 的 '4.5'。（如果你想把它作为一个数字来操作，你必须做 score = float(score).

Need help about string match in web scraping, python