scrapy xpath 没有返回期望的结果。任何的想法?

scrapy xpath not returning desired results. Any idea?

请查看此页面http://164.100.47.132/LssNew/psearch/QResult16.aspx?qref=15845。正如您所猜到的,我正在尝试抓取此页面上的所有字段。除 Answer 字段外,所有字段均已正确生成。我觉得奇怪的是问题和答案的页面结构几乎相同(Table[1] 和 Table[2]);这个问题很完美,但答案却没有。这是我的 xpath:

问题:

['q_main'] = Selector(response).xpath('//*[@id="ctl00_ContPlaceHolderMain_GridView2"]/tbody/tr/td/table[1]/tbody/tr/td/text()').extract()

工作完美

答案:

['q_answer'] = Selector(response).xpath('//*[@id="ctl00_ContPlaceHolderMain_GridView2"]/tbody/tr/td/table[2]/tbody/tr[2]/td/text()').extract()

returns一片空白。我已经复制了完整的 xpath,在 Xpath Helper 和控制台中返回 by/verified。 我忽略了什么?我看不到什么?

看来你的 xpath 有问题,

从 scrapy 查看演示 shell,

In [1]: response.xpath('//tr[td[@class="mainheaderq" and contains(font/text(), "ANSWER")]]/following-sibling::tr/td[@class="griditemq"]//text()').extract()
Out[1]: 
[u'\r\n\r\n',
 u'MINISTER OF STATE(I/C) FOR COAL, POWER AND NEW & RENEWABLE ENERGY   (SHRI PIYUSH GOYAL)\r\n\r\n ',
 u'(a) & (b): So far 29 coal mines have been auctioned under the provisions of Coal Mines (Special Provisions) \r\nAct, 2015 and the Rules made thereunder. The auction process for non-regulated sector viz. Iron and Steel, \r\nCement and Captive Power was based on forward bidding process where bidders had to submit their final price \r\noffer above the applicable floor price. In case of Power sector which is a regulated one, reverse bidding \r\nmethodology was adopted where bidders had to submit bids below the applicable ceiling price, which shall be \r\ntaken as fuel cost in determination of power tariff. In case, bid price reaches Rs. zero in reverse bidding, \r\nthe bidding is based on additional premium payable to the concerned State Government, over and  above  the  \r\nfixed  reserve  price  of  Rs. 100/-  per  tonne.\r\n\r\n',
 u'\r\nRevenue which would accrue to the coal bearing State Government concerned comprises of Upfront payment \r\nas prescribed in the tender document, Auction proceeds and Royalty on per tonne of coal production. State-wise \r\ndetails of 29 coal mines auctioned so far along-with specified end-uses and estimated revenue which would accrue \r\nto coal bearing state during the life of mine/lease period as given below:\r\n',
 u'\r\n\r\nS.No\tState\t\tSpecified End \u2013Use\t\t\tName of Coal Mine\t\tEstimated Revenueduring \r\n\t\t\t\t\t\t\t\t\t\t\t\tthe life of mine/lease \r\n\t\t\t\t\t\t\t\t\t\t\t\tperiod (Rs. In Crores)\r\n1\tChattishgarh\tNon-Regualted Sector\t\t\tChotia\t\t\t\t51596\r\n\t\t\t\t\t\t\t\tGare Palma IV-4\t\r\n\t\t\t\t\t\t\t\tGare Palma IV-5\t\r\n\t\t\t\t\t\t\t\tGare Palma IV-7\t\r\n\t\t\t\t\t\t\t\tGare-Palma Sector-IV/8\r\n2\tJharkhand\tNon-Regualted Sector\t\t\tBrinda and Sasai\t\t49272\r\n\t\t\t\t\t\t\t\tDumri\r\n\t\t\t\t\t\t\t\tKathautia\r\n\t\t\t\t\t\t\t\tLohari\r\n\t\t\t\t\t\t\t\tMeral\r\n\t\t\t\t\t\t\t\tMoitra\r\n\t\t\tPower\t\t\t\t\tGaneshpur\r\n\t\t\t\t\t\t\t\tJitpur\r\n\t\t\t\t\t\t\t\tTokisud North\r\n3\tMadhya Pradesh\tNon-Regualted Sector\t\t\tBicharpur\t\t\t42811\r\n\t\t\t\t\t\t\t\tMandla North\r\n\t\t\t\t\t\t\t\tMandla-South\r\n\t\t\t\t\t\t\t\tSialGhoghri\r\n\t\t\tPower\t\t\t\t\tAmelia North\r\n4\tMaharashtra\tNon-Regualted Sector\t\t\tBelgaon\t\t\t\t2738\r\n\t\t\t\t\t\t\t\tMarkiMangli III\r\n\t\t\t\t\t\t\t\tNerad Malegaon\r\n5\tOdisha\t\tPower\t\t\t\t\tMandakini\t\t\t33741\r\n\t\t\t\t\t\t\t\tTalabira-I\r\n\t\t\t\t\t\t\t\tUtkal - C\r\n6\tWest Bengal\tNon-Regualted Sector\t\t\tArdhagram\t\t\t13354\r\n\t\t\tPower\t\t\t\t\tSarisatolli\r\n\t\t\t\t\t\t\t\tTrans Damodar\r\n\tTotal\t\t\t\t\t\t\t(29) coal blocks\t\t193512\r\n',
 u'\r\n\r\n\r\nCoal mine has been assigned to successful bidder as Designated Custodian in view of a court case.\r\n\r\n',
 u'\r\nIn addition, an estimated amount of Rs. 1,41,854 Crores would accrue to coal bearing States from allotment \r\nof 38 coal mines to Central and State PSU\u2019s.\r\n\r\n',
 u'Out of these 29 coal mines, 16 are operational coal mines included in Schedule-II of the Act and 13 are \r\nnon-operational included in Schedule-III of the Act. Milestones for development and production of coal \r\nfrom the auctioned coal mines have been prescribed under the Coal Mines Development and Production Agreement \r\nsigned with the Successful Bidder. \r\n\r\n ',
 u'(c) & (d): Yes, Sir. A few complaints were received regarding cartelization in bidding. It is not possible to \r\nconclusively establish the same until investigation are carried out by Competent Authority. ',
 u'\r\n\r\n\r\nThe Government has not approved the recommendation of NA for declaration of successful bidder in case of \r\n4 coal mines namely Gare Palma IV/2&3, Gare Palma IV/1 and Tara as final closing bid price was not found \r\nto be reflecting fair value.  ',
 u'\r\n\r\n\r\n']

当您处理 tables 时,有时会发生这种情况,有关更多信息,您可以参考

至少部分困难在于您在控制台中看到的代码 而不是 您的蜘蛛获取的来源 html一个响应(并且 selectors 在其上运行)。 特别是,<table> 不包含 <tbody> 是极其常见的;但是当你的浏览器将 html 翻译成 DOM 树时,它会插入 <tbody> 标签。曾经有一段时间,网页的大部分布局实际上是通过(疯狂地)嵌套表格完成的。因此,此类网站的 DOM 通常比 html 源包含更多 <tbody> 元素。

这实际上意味着:

  1. 为您想要 select 的元素找到相对简单的 xpath(或 CSS select 或,或 ...) -- 不是您有时从开发人员工具中获得的庞然大物。
  2. 在您的 xpath 中包含 /tbody 通常不是一个好主意(除非有关联的属性,表明该标签存在于源 html 中)。

对于相关网站,

 response.xpath('//td[@class="griditemq"]').extract()

returns 第一个元素是问题,第二个元素是答案的列表。