响应选择器将内容分成两个不同的值

Question

我正在尝试从此页面抓取文章的标题 - https://onlinelibrary.wiley.com/doi/full/10.1111/pcmr.12547

在“scrapy shell”如果我运行这个 response.css("h2.article-section__title::text").extract() 我得到以下输出 -

[' Efficacy of small MC1R‐selective ',
 '‐MSH analogs as sunless tanning agents that reduce UV‐induced DNA damage\n         ',
.....

发生这种情况是因为，在 HTML 中，文章在标题中使用了额外的斜体标记。

<h2 class="article-section__title section__title section1" id="pcmr12547-sec-0002-title"> Efficacy of small MC1R‐selective <i>α </i>‐MSH analogs as sunless tanning agents that reduce UV‐induced DNA damage
         </h2>

我可以尝试使用 python 代码来解决这个问题，该代码将组合这些值，直到它在末尾收到“\n”。但是有什么办法可以通过 scrapy 或任何其他更清洁的方式来修复它吗？

一种 scrapy 将值连同其中的 HTML 标签（如果有）一起抓取的方法，或者最好忽略标签但会获取标签中的文本？

Answer 1

您可以使用以下方法提取整个 HTML 元素：

html_title = response.css(".article-section__title").get()

然后你可以把结果变成纯文本，比如 html-text:

title = html_text.extract_text(html_title)

响应选择器将内容分成两个不同的值

response selector is breaking the content into two different values

scrapy

python-3.x

scrapy-shell