如何用scrapy或beautiful soup提取特定html标签的内容?
How to extract the content of specific html tags with scrapy or beautiful soup?
我正在制作这个 site 的玩具爬虫,以便通过 scrapy 进行改进。因此,在 scrapy shell 我试过:
In [1]: for e in response.css('meta.keywords').extract():
...: print(e)
Out:
<meta class="keywords" itemprop="keywords" content="abilities,choices">
<meta class="keywords" itemprop="keywords" content="inspirational,life,live,miracle,miracles">
<meta class="keywords" itemprop="keywords" content="aliteracy,books,classic,humor">
<meta class="keywords" itemprop="keywords" content="be-yourself,inspirational">
<meta class="keywords" itemprop="keywords" content="adulthood,success,value">
<meta class="keywords" itemprop="keywords" content="life,love">
<meta class="keywords" itemprop="keywords" content="edison,failure,inspirational,paraphrased">
<meta class="keywords" itemprop="keywords" content="misattributed-eleanor-roosevelt">
<meta class="keywords" itemprop="keywords" content="humor,obvious,simile">
用beautiful soup或者scrapy如何获取每个meta
的内容?
您实际上可以通过调整选择器一次性完成:
for e in response.css('meta.keywords::attr(content)').extract():
print(e)
注意attr
是Scrapy自己添加的非标准custom selector
我正在制作这个 site 的玩具爬虫,以便通过 scrapy 进行改进。因此,在 scrapy shell 我试过:
In [1]: for e in response.css('meta.keywords').extract():
...: print(e)
Out:
<meta class="keywords" itemprop="keywords" content="abilities,choices">
<meta class="keywords" itemprop="keywords" content="inspirational,life,live,miracle,miracles">
<meta class="keywords" itemprop="keywords" content="aliteracy,books,classic,humor">
<meta class="keywords" itemprop="keywords" content="be-yourself,inspirational">
<meta class="keywords" itemprop="keywords" content="adulthood,success,value">
<meta class="keywords" itemprop="keywords" content="life,love">
<meta class="keywords" itemprop="keywords" content="edison,failure,inspirational,paraphrased">
<meta class="keywords" itemprop="keywords" content="misattributed-eleanor-roosevelt">
<meta class="keywords" itemprop="keywords" content="humor,obvious,simile">
用beautiful soup或者scrapy如何获取每个meta
的内容?
您实际上可以通过调整选择器一次性完成:
for e in response.css('meta.keywords::attr(content)').extract():
print(e)
注意attr
是Scrapy自己添加的非标准custom selector