抓取结果link的页面打不开
The page of the crawl result link does not open
这是我的 Google 搜索结果抓取代码。
class GoogleBotsSpider(scrapy.Spider):
name = 'GoogleScrapyBot'
allowed_domains = ['google.com']
start_urls = [
f'https://www.google.com/search?q=apple+"iphone"+intext:iphone12&hl=en&rlz=&start=0']
def parse(self, response):
titles = response.xpath('//*[@id="main"]/div/div/div/a/h3/div//text()').extract()
links = response.xpath('//*[@id="main"]/div/div/div/a/@href').extract()
items = []
for idx in range(len(titles)):
item = GoogleScraperItem()
item['title'] = titles[idx]
item['link'] = links[idx].lstrip("/url?q=")
items.append(item)
df = pd.DataFrame(items, columns=['title', 'link'])
writer = pd.ExcelWriter('test1.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='test1.xlsx')
writer.save()
return items
每个 title/link 我可以获得九个项目结果。
https://www.google.com/search?q=apple+"iphone"+intext:iphone12&hl=en&rlz=&start=0
当我打开 excel 文件 (test1.xlsx) 时,所有链接都无法正常打开。
添加如下“settings.py”.
USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36"
ROBOTSTXT_OBEY = False
如果您密切注意提取的网址,它们都有 sa
、ved
和 usg
查询参数。显然,这些不是目标站点 url 的一部分,而是 google 搜索结果查询参数。
要仅获取目标网址,您应该使用 urllib
库解析网址,并仅提取 q
查询参数。
from urllib.parse import urlparse, parse_qs
parsed_url = urlparse(url)
query_params = parse_qs(parsed_url.query)
target_url = query_params["q"][0]
完整工作代码:
from urllib.parse import urlparse, parse_qs
class GoogleBotsSpider(scrapy.Spider):
name = 'GoogleScrapyBot'
allowed_domains = ['google.com']
start_urls = [
f'https://www.google.com/search?q=apple+"iphone"+intext:iphone12&hl=en&rlz=&start=0']
def parse(self, response):
titles = response.xpath('//*[@id="main"]/div/div/div/a/h3/div//text()').extract()
links = response.xpath('//*[@id="main"]/div/div/div/a/@href').extract()
items = []
for idx in range(len(titles)):
item = GoogleScraperItem()
item['title'] = titles[idx]
# Parsing item url
parsed_url = urlparse(links[idx])
query_params = parse_qs(parsed_url.query)
item['link'] = query_params["q"][0]
items.append(item)
df = pd.DataFrame(items, columns=['title', 'link'])
writer = pd.ExcelWriter('test1.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='test1.xlsx')
writer.save()
return items
这是我的 Google 搜索结果抓取代码。
class GoogleBotsSpider(scrapy.Spider):
name = 'GoogleScrapyBot'
allowed_domains = ['google.com']
start_urls = [
f'https://www.google.com/search?q=apple+"iphone"+intext:iphone12&hl=en&rlz=&start=0']
def parse(self, response):
titles = response.xpath('//*[@id="main"]/div/div/div/a/h3/div//text()').extract()
links = response.xpath('//*[@id="main"]/div/div/div/a/@href').extract()
items = []
for idx in range(len(titles)):
item = GoogleScraperItem()
item['title'] = titles[idx]
item['link'] = links[idx].lstrip("/url?q=")
items.append(item)
df = pd.DataFrame(items, columns=['title', 'link'])
writer = pd.ExcelWriter('test1.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='test1.xlsx')
writer.save()
return items
每个 title/link 我可以获得九个项目结果。
https://www.google.com/search?q=apple+"iphone"+intext:iphone12&hl=en&rlz=&start=0
当我打开 excel 文件 (test1.xlsx) 时,所有链接都无法正常打开。
USER_AGENT = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36"
ROBOTSTXT_OBEY = False
如果您密切注意提取的网址,它们都有 sa
、ved
和 usg
查询参数。显然,这些不是目标站点 url 的一部分,而是 google 搜索结果查询参数。
要仅获取目标网址,您应该使用 urllib
库解析网址,并仅提取 q
查询参数。
from urllib.parse import urlparse, parse_qs
parsed_url = urlparse(url)
query_params = parse_qs(parsed_url.query)
target_url = query_params["q"][0]
完整工作代码:
from urllib.parse import urlparse, parse_qs
class GoogleBotsSpider(scrapy.Spider):
name = 'GoogleScrapyBot'
allowed_domains = ['google.com']
start_urls = [
f'https://www.google.com/search?q=apple+"iphone"+intext:iphone12&hl=en&rlz=&start=0']
def parse(self, response):
titles = response.xpath('//*[@id="main"]/div/div/div/a/h3/div//text()').extract()
links = response.xpath('//*[@id="main"]/div/div/div/a/@href').extract()
items = []
for idx in range(len(titles)):
item = GoogleScraperItem()
item['title'] = titles[idx]
# Parsing item url
parsed_url = urlparse(links[idx])
query_params = parse_qs(parsed_url.query)
item['link'] = query_params["q"][0]
items.append(item)
df = pd.DataFrame(items, columns=['title', 'link'])
writer = pd.ExcelWriter('test1.xlsx', engine='xlsxwriter')
df.to_excel(writer, sheet_name='test1.xlsx')
writer.save()
return items