Python、Scrapy - 使用 xpath 或 css 提取 herf

Python, Scrapy - Extract herf with xpath or css

这是页面来源(Google 搜索结果,Chrome)

<div class="yuRUbf">
<a href="https://www.apple.com/my/iphone/compare/" data-ved="2ahUKEwitnOWgoMHxAhUdxIsBHVDpCmIQFjALegQIAxAD" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://www.apple.com/my/iphone/compare/&amp;ved=2ahUKEwitnOWgoMHxAhUdxIsBHVDpCmIQFjALegQIAxAD">
    <br>
        <h3 class="LC20lb DKV0Md">iPhone - Compare Models - Apple (MY)</h3><div class="TbwUpd NJjxre"><cite class="iUh30 Zu0yb qLRx3b tjvcx">https: //www.apple.com<span class="dyjrff qzEoUe"> › iphone › compare</span></cite></div></a><div class="B6fmyf"><div class="TbwUpd"><cite class="iUh30 Zu0yb qLRx3b tjvcx">https://www.apple.com<span class="dyjrff qzEoUe"> › iphone › compare</span></cite></div><div class="eFM0qc"><span><div jscontroller="hiU8Ie" class="action-menu"><a class="GHDvEf" href="#" aria-label="Result options" aria-expanded="false" aria-haspopup="true" role="button" jsaction="PZcoEd;keydown:wU6FVd;keypress:uWmNaf" data-ved="2ahUKEwitnOWgoMHxAhUdxIsBHVDpCmIQ7B0wC3oECAMQBg"><span class="gTl8xb"></span></a><ol class="action-menu-panel zsYMMe" role="menu" tabindex="-1" jsaction="keydown:Xiq7wd;mouseover:pKPowd;mouseout:O9bKS" data-ved="2ahUKEwitnOWgoMHxAhUdxIsBHVDpCmIQqR8wC3oECAMQBw"><li class="action-menu-item OhScic zsYMMe" role="menuitem"><a class="fl" href="https://webcache.googleusercontent.com/search?q=cache:6zhDHqY_aM4J:https://www.apple.com/my/iphone/compare/+&amp;cd=12&amp;hl=en&amp;ct=clnk&amp;gl=kr" ping="/url?sa=t&amp;source=web&amp;rct=j&amp;url=https://webcache.googleusercontent.com/search%3Fq%3Dcache:6zhDHqY_aM4J:https://www.apple.com/my/iphone/compare/%2B%26cd%3D12%26hl%3Den%26ct%3Dc

这是解析函数,我用xpath提取我想要的项目(标题,链接)

    def parse(self, response):
    titles = response.xpath('//*[@id="main"]/div/div/div/a/h3/div//text()').extract()
    links = response.xpath('//*[@id="main"]/div/div/div/a/@href').extract()

    items = []

    for idx in range(len(titles)):
        item = GoogleScraperItem()
        item['title'] = titles[idx]
        item['link'] = links[idx].lstrip("/url?q=")
        print('titles', titles)
        print('links', links)

        items.append(item)
        df = pd.DataFrame(items, columns=['title', 'link'])
        writer = pd.ExcelWriter('test6.xlsx', engine='xlsxwriter')
        df.to_excel(writer, sheet_name='test6.xlsx')
        writer.save()
    return items

输出

'link':https://www.apple.com/my/iphone/compare/&sa=U&ved=2ahUKEwjB74yUrsHxAhXInGoFHeKADSAQFjAAegQIBxAB&usg=AOvVaw1Wgg_RVEQfHS30tbhmlwzv',

'title': 'iPhone - Compare Models - Apple (MY)'

..跳过..

问题

不需要LINKstring.Actually,如果在Chrome中打开,页面无法正常打开。

Working link : https://www.apple.com/my/iphone/compare/

问题

要使用 xpath 或 css 提取“Working Link”?

一种解决方案是 post 处理 link 。 使用参数分隔符 '&' 删除参数 .

link=link.split('&')[0]

所以,说你想要:

def parse(self, response):
    titles = response.xpath('//*[@id="main"]/div/div/div/a/h3/div//text()').extract()
    links = response.xpath('//*[@id="main"]/div/div/div/a/@href').extract()

    items = []

    for idx in range(len(titles)):
        item = GoogleScraperItem()
        item['title'] = titles[idx]
        item['link'] = links[idx].lstrip("/url?q=")
        #Inserted here 
        item['link'] = item['link].split('&')[0]
    print('titles', titles)
    print('links', links)

    items.append(item)
    df = pd.DataFrame(items, columns=['title', 'link'])
    writer = pd.ExcelWriter('test6.xlsx', engine='xlsxwriter')
    df.to_excel(writer, sheet_name='test6.xlsx')
    writer.save()
return items