Python、Scrapy - 使用 xpath 或 css 提取 herf
Python, Scrapy - Extract herf with xpath or css
这是页面来源(Google 搜索结果,Chrome)
<div class="yuRUbf">
<a href="https://www.apple.com/my/iphone/compare/" data-ved="2ahUKEwitnOWgoMHxAhUdxIsBHVDpCmIQFjALegQIAxAD" ping="/url?sa=t&source=web&rct=j&url=https://www.apple.com/my/iphone/compare/&ved=2ahUKEwitnOWgoMHxAhUdxIsBHVDpCmIQFjALegQIAxAD">
<br>
<h3 class="LC20lb DKV0Md">iPhone - Compare Models - Apple (MY)</h3><div class="TbwUpd NJjxre"><cite class="iUh30 Zu0yb qLRx3b tjvcx">https: //www.apple.com<span class="dyjrff qzEoUe"> › iphone › compare</span></cite></div></a><div class="B6fmyf"><div class="TbwUpd"><cite class="iUh30 Zu0yb qLRx3b tjvcx">https://www.apple.com<span class="dyjrff qzEoUe"> › iphone › compare</span></cite></div><div class="eFM0qc"><span><div jscontroller="hiU8Ie" class="action-menu"><a class="GHDvEf" href="#" aria-label="Result options" aria-expanded="false" aria-haspopup="true" role="button" jsaction="PZcoEd;keydown:wU6FVd;keypress:uWmNaf" data-ved="2ahUKEwitnOWgoMHxAhUdxIsBHVDpCmIQ7B0wC3oECAMQBg"><span class="gTl8xb"></span></a><ol class="action-menu-panel zsYMMe" role="menu" tabindex="-1" jsaction="keydown:Xiq7wd;mouseover:pKPowd;mouseout:O9bKS" data-ved="2ahUKEwitnOWgoMHxAhUdxIsBHVDpCmIQqR8wC3oECAMQBw"><li class="action-menu-item OhScic zsYMMe" role="menuitem"><a class="fl" href="https://webcache.googleusercontent.com/search?q=cache:6zhDHqY_aM4J:https://www.apple.com/my/iphone/compare/+&cd=12&hl=en&ct=clnk&gl=kr" ping="/url?sa=t&source=web&rct=j&url=https://webcache.googleusercontent.com/search%3Fq%3Dcache:6zhDHqY_aM4J:https://www.apple.com/my/iphone/compare/%2B%26cd%3D12%26hl%3Den%26ct%3Dc
这是解析函数,我用xpath提取我想要的项目(标题,链接)
def parse(self, response):
titles = response.xpath('//*[@id="main"]/div/div/div/a/h3/div//text()').extract()
links = response.xpath('//*[@id="main"]/div/div/div/a/@href').extract()
items = []
for idx in range(len(titles)):
item = GoogleScraperItem()
item['title'] = titles[idx]
item['link'] = links[idx].lstrip("/url?q=")
print('titles', titles)
print('links', links)
items.append(item)
df = pd.DataFrame(items, columns=['title', 'link'])
writer = pd.ExcelWriter('test6.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='test6.xlsx')
writer.save()
return items
输出
'title': 'iPhone - Compare Models - Apple (MY)'
..跳过..
问题
不需要LINKstring.Actually,如果在Chrome中打开,页面无法正常打开。
Working link : https://www.apple.com/my/iphone/compare/
问题
要使用 xpath 或 css 提取“Working Link”?
一种解决方案是 post 处理 link 。
使用参数分隔符 '&' 删除参数 .
link=link.split('&')[0]
所以,说你想要:
def parse(self, response):
titles = response.xpath('//*[@id="main"]/div/div/div/a/h3/div//text()').extract()
links = response.xpath('//*[@id="main"]/div/div/div/a/@href').extract()
items = []
for idx in range(len(titles)):
item = GoogleScraperItem()
item['title'] = titles[idx]
item['link'] = links[idx].lstrip("/url?q=")
#Inserted here
item['link'] = item['link].split('&')[0]
print('titles', titles)
print('links', links)
items.append(item)
df = pd.DataFrame(items, columns=['title', 'link'])
writer = pd.ExcelWriter('test6.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='test6.xlsx')
writer.save()
return items
这是页面来源(Google 搜索结果,Chrome)
<div class="yuRUbf">
<a href="https://www.apple.com/my/iphone/compare/" data-ved="2ahUKEwitnOWgoMHxAhUdxIsBHVDpCmIQFjALegQIAxAD" ping="/url?sa=t&source=web&rct=j&url=https://www.apple.com/my/iphone/compare/&ved=2ahUKEwitnOWgoMHxAhUdxIsBHVDpCmIQFjALegQIAxAD">
<br>
<h3 class="LC20lb DKV0Md">iPhone - Compare Models - Apple (MY)</h3><div class="TbwUpd NJjxre"><cite class="iUh30 Zu0yb qLRx3b tjvcx">https: //www.apple.com<span class="dyjrff qzEoUe"> › iphone › compare</span></cite></div></a><div class="B6fmyf"><div class="TbwUpd"><cite class="iUh30 Zu0yb qLRx3b tjvcx">https://www.apple.com<span class="dyjrff qzEoUe"> › iphone › compare</span></cite></div><div class="eFM0qc"><span><div jscontroller="hiU8Ie" class="action-menu"><a class="GHDvEf" href="#" aria-label="Result options" aria-expanded="false" aria-haspopup="true" role="button" jsaction="PZcoEd;keydown:wU6FVd;keypress:uWmNaf" data-ved="2ahUKEwitnOWgoMHxAhUdxIsBHVDpCmIQ7B0wC3oECAMQBg"><span class="gTl8xb"></span></a><ol class="action-menu-panel zsYMMe" role="menu" tabindex="-1" jsaction="keydown:Xiq7wd;mouseover:pKPowd;mouseout:O9bKS" data-ved="2ahUKEwitnOWgoMHxAhUdxIsBHVDpCmIQqR8wC3oECAMQBw"><li class="action-menu-item OhScic zsYMMe" role="menuitem"><a class="fl" href="https://webcache.googleusercontent.com/search?q=cache:6zhDHqY_aM4J:https://www.apple.com/my/iphone/compare/+&cd=12&hl=en&ct=clnk&gl=kr" ping="/url?sa=t&source=web&rct=j&url=https://webcache.googleusercontent.com/search%3Fq%3Dcache:6zhDHqY_aM4J:https://www.apple.com/my/iphone/compare/%2B%26cd%3D12%26hl%3Den%26ct%3Dc
这是解析函数,我用xpath提取我想要的项目(标题,链接)
def parse(self, response):
titles = response.xpath('//*[@id="main"]/div/div/div/a/h3/div//text()').extract()
links = response.xpath('//*[@id="main"]/div/div/div/a/@href').extract()
items = []
for idx in range(len(titles)):
item = GoogleScraperItem()
item['title'] = titles[idx]
item['link'] = links[idx].lstrip("/url?q=")
print('titles', titles)
print('links', links)
items.append(item)
df = pd.DataFrame(items, columns=['title', 'link'])
writer = pd.ExcelWriter('test6.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='test6.xlsx')
writer.save()
return items
输出
'title': 'iPhone - Compare Models - Apple (MY)'
..跳过..
问题
不需要LINKstring.Actually,如果在Chrome中打开,页面无法正常打开。
Working link : https://www.apple.com/my/iphone/compare/
问题
要使用 xpath 或 css 提取“Working Link”?
一种解决方案是 post 处理 link 。 使用参数分隔符 '&' 删除参数 .
link=link.split('&')[0]
所以,说你想要:
def parse(self, response):
titles = response.xpath('//*[@id="main"]/div/div/div/a/h3/div//text()').extract()
links = response.xpath('//*[@id="main"]/div/div/div/a/@href').extract()
items = []
for idx in range(len(titles)):
item = GoogleScraperItem()
item['title'] = titles[idx]
item['link'] = links[idx].lstrip("/url?q=")
#Inserted here
item['link'] = item['link].split('&')[0]
print('titles', titles)
print('links', links)
items.append(item)
df = pd.DataFrame(items, columns=['title', 'link'])
writer = pd.ExcelWriter('test6.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='test6.xlsx')
writer.save()
return items