如何用scrapy或beautiful soup提取特定html标签的内容?

How to extract the content of specific html tags with scrapy or beautiful soup?

我正在制作这个 site 的玩具爬虫,以便通过 scrapy 进行改进。因此,在 scrapy shell 我试过:

In [1]: for e in response.css('meta.keywords').extract():
    ...:     print(e)

Out:

<meta class="keywords" itemprop="keywords" content="abilities,choices">
<meta class="keywords" itemprop="keywords" content="inspirational,life,live,miracle,miracles">
<meta class="keywords" itemprop="keywords" content="aliteracy,books,classic,humor">
<meta class="keywords" itemprop="keywords" content="be-yourself,inspirational">
<meta class="keywords" itemprop="keywords" content="adulthood,success,value">
<meta class="keywords" itemprop="keywords" content="life,love">
<meta class="keywords" itemprop="keywords" content="edison,failure,inspirational,paraphrased">
<meta class="keywords" itemprop="keywords" content="misattributed-eleanor-roosevelt">
<meta class="keywords" itemprop="keywords" content="humor,obvious,simile">

用beautiful soup或者scrapy如何获取每个meta的内容?

您实际上可以通过调整选择器一次性完成:

for e in response.css('meta.keywords::attr(content)').extract():
    print(e)

注意attr是Scrapy自己添加的非标准custom selector