刮擦时如何消除某些元素？

Question

所以我不确定如何在这里进行。我有一个要抓取的页面示例：

http://www.yonhapnews.co.kr/sports/2015/06/05/1001000000AKR20150605128600007.HTML?template=7722

现在我有 xpath 选择 'article' div class 然后是后续的 <p> 我总是可以删除第一个因为它是相同的股票新闻文本（城市、韩联社、记者等）我正在评估单词密度，所以这对我来说可能是个问题:(

问题出现在文章末尾。如果你看向最后有一个记者的电子邮件地址和发布的日期和时间...

问题是在这个网站的不同页面上，末尾有不同数量的 <p> 标签，所以我不能只删除最后两个，因为它有时仍然会影响我的结果。

你会如何在最后消除那些特定的 <p> 元素？之后我是否只需要尝试清理我的数据？

这是选择路径并消除前两个 <p> 和后两个的代码片段。我该如何更改？

# gets all the text from the listed div and then applies the regex to find all word objects in hanul range
hangul_syllables = response.xpath('//*[@class="article"]/p//text()').re(ur'[\uac00-\ud7af]+')

# For yonhapnews the first and the last two <p>'s are useless, everything else should be good
hangul_syllables = hangul_syllables[1:-2]

Answer 1

您可以调整您的 XPath 表达式，使其不包含具有 class="adrs"（发布日期）的 p 标记：

//*[@class="article"]/p[not(contains(@class, "adrs"))]//text()

Answer 2

添加到 alecxe 的答案中，您可以使用检查电子邮件地址（可能被空格包围）的东西排除包含电子邮件地址的 p。如何做到这一点取决于您拥有 XPath 2.0 还是只有 1.0。在 2.0 中你可以这样做：

//*[@class="article"]/p[not(contains(@class, "adrs")
       or text()[matches(normalize-space(.),
                   "^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$", "i")])]//text()

为来自 http://www.regular-expressions.info/email.html 的电子邮件地址调整正则表达式。如果您愿意，可以将 \.[A-Z]{2,4} 更改为 \.kr。

刮擦时如何消除某些元素？

How to eliminate certain elements when scraping?

python

xpath

scrapy

web-scraping