在 python 中,如何抓取 return 隐藏元素的内容?
In python, how can I get scrapy to return the contents of an element that is hidden?
我在 Python 中使用 scrapy 并想检索另一个 "expand" 元素后面的元素的内容。在检查 DOM 树时,div 标签和文本本身在第一次单击父元素之前不会加载。单击父级后,文本可以重新隐藏,但至少会在 DOM 中。
示例网站是 here。我在哪里寻找摘要文本(在单击 "Abstract" link 之前不会加载)。
乱七八糟的命令是:
response.xpath("//div[@class='previewBox abstract hidden']").extract()
但是 return 是一堆空的 div,就像这样:
u'<div id="abs_S0740002015000179" class="previewBox abstract hidden"></div>'
如果我使用这个:response.xpath("//div[@class='previewBox abstract']").extract()
那么它根本不会 return 任何东西。
您需要模拟在 abtract
link 单击时发送的额外 HTTP GET 请求。
想法是提取并请求"Abstract"link的data-url
属性值。
来自 "Scrapy Shell" 的演示:
$ scrapy shell "http://www.sciencedirect.com/science?_ob=ArticleListURL&_method=list&_ArticleListID=-764831607&_sort=r&_st=13&view=c&md5=a41e9f25739feae932862575251c1e0d&searchtype=a"
In [1]: url = response.xpath("//a[@data-type='abstract']/@data-url").extract()[0]
In [2]: fetch(url)
In [3]: print "".join(response.xpath("//div[@class='articleText']//text()").extract())
AbstractThe aim of the present study was to investigate the effect of lactic acid against Shiga toxin producing Escherichia coli (O157:H7 and non-O157 serogroups including O103, O111, O145 and O26) at different conditions. Soybean sprouts and spinach leaves inoculated with each serogroup of E. coli (∼7.00 + 1.00 log10 cfu/g) were treated with the lactic acid solutions at different concentrations (0% (control), 1.5%, 2.0%, or 2.5%) and at different temperatures (20, 40, or 50 °C) for 3 min. Results indicated that regardless of the treatment temperature, no significant reduction in the numbers of any serogroup occurred in the control group (0%) (p > 0.05). However, lactic acid at concentration of 1.5%, 2% and 2.5% was found to be effective against all organisms tested. There was no significant difference (p > 0.05) between E. coli O157:H7 and non-O157 STEC serogroups at any treatment group. The highest reductions (ca. 4.00 log10 cfu/g) of all serotypes in both produces were observed after immersing into 2.5% lactic acid at 50 °C. The results of this study showed that decontamination of fresh produces such as spinach and soybean sprout with lactic acid solutions prepared at mild temperatures (40 °C and 50 °C) might be an effective safety measure in preventing public health risks associated with these products contaminated with STEC.
请注意,此 fetch()
调用是在 shell 中发出额外请求的一种特殊方式。在您的 Scrapy 蜘蛛中,您需要 yield
或 return
scrapy.http.Request()
实例并在 callback
.
中解析结果
我在 Python 中使用 scrapy 并想检索另一个 "expand" 元素后面的元素的内容。在检查 DOM 树时,div 标签和文本本身在第一次单击父元素之前不会加载。单击父级后,文本可以重新隐藏,但至少会在 DOM 中。
示例网站是 here。我在哪里寻找摘要文本(在单击 "Abstract" link 之前不会加载)。
乱七八糟的命令是:
response.xpath("//div[@class='previewBox abstract hidden']").extract()
但是 return 是一堆空的 div,就像这样:
u'<div id="abs_S0740002015000179" class="previewBox abstract hidden"></div>'
如果我使用这个:response.xpath("//div[@class='previewBox abstract']").extract()
那么它根本不会 return 任何东西。
您需要模拟在 abtract
link 单击时发送的额外 HTTP GET 请求。
想法是提取并请求"Abstract"link的data-url
属性值。
来自 "Scrapy Shell" 的演示:
$ scrapy shell "http://www.sciencedirect.com/science?_ob=ArticleListURL&_method=list&_ArticleListID=-764831607&_sort=r&_st=13&view=c&md5=a41e9f25739feae932862575251c1e0d&searchtype=a"
In [1]: url = response.xpath("//a[@data-type='abstract']/@data-url").extract()[0]
In [2]: fetch(url)
In [3]: print "".join(response.xpath("//div[@class='articleText']//text()").extract())
AbstractThe aim of the present study was to investigate the effect of lactic acid against Shiga toxin producing Escherichia coli (O157:H7 and non-O157 serogroups including O103, O111, O145 and O26) at different conditions. Soybean sprouts and spinach leaves inoculated with each serogroup of E. coli (∼7.00 + 1.00 log10 cfu/g) were treated with the lactic acid solutions at different concentrations (0% (control), 1.5%, 2.0%, or 2.5%) and at different temperatures (20, 40, or 50 °C) for 3 min. Results indicated that regardless of the treatment temperature, no significant reduction in the numbers of any serogroup occurred in the control group (0%) (p > 0.05). However, lactic acid at concentration of 1.5%, 2% and 2.5% was found to be effective against all organisms tested. There was no significant difference (p > 0.05) between E. coli O157:H7 and non-O157 STEC serogroups at any treatment group. The highest reductions (ca. 4.00 log10 cfu/g) of all serotypes in both produces were observed after immersing into 2.5% lactic acid at 50 °C. The results of this study showed that decontamination of fresh produces such as spinach and soybean sprout with lactic acid solutions prepared at mild temperatures (40 °C and 50 °C) might be an effective safety measure in preventing public health risks associated with these products contaminated with STEC.
请注意,此 fetch()
调用是在 shell 中发出额外请求的一种特殊方式。在您的 Scrapy 蜘蛛中,您需要 yield
或 return
scrapy.http.Request()
实例并在 callback
.