Small python 3 从给定网站获取 url 的脚本

Small python 3 script to fetch urls from given website

我喜欢从一个用python写的网站上得到某些links 3.我试过自己写但失败了(作为初学者)。

我希望脚本执行以下操作:

  1. 问我 url(即 https://familysearch.org/search/image/index?owc=Q69L-N6T%3A116559001%2C116559002%2C116559003%3Fcc%3D1601210)。
  2. 向我询问关键字(不区分大小写,但 space 敏感!),例如 "matrimonios 2000",以获取给定网站的相应 link。
  3. 获取 link 名称中具有 "matrimonios 2000" 的所有 url(在此示例中 它将是 27 urls 命名为 "Matrimonios 2000 vol 1" 直到 "Matrimonios 2000 vol 14").
  4. 逐行保存对应的urls在名为"urls.txt"的文件中 脚本所在的同一文件夹 运行.

到目前为止,这是我的代码:

#!/usr/bin/env python3

import urllib2
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

url = input('Please, enter url: ')
try:
    keyword = string(input('Type keyword(s): '))
except ValueError:
    print('You must enter a string value.')

driver = webdriver.Firefox()
urls = driver.find_elements_by_xpath('keyword')
for url in urls:
    print url.get_attribute("href")

file = open('urls.txt', 'w')
f.write(url)
f.close()

一般的答案是这样的:

url = 'https://familysearch.org/search/image/index?owc=Q69L-N6T%3A116559001%2C116559002%2C116559003%3Fcc%3D1601210'
keyword = 'matrimonios 2000'

html = requests.get(url).content
soup = BeautifulSoup(html)
for link in soup.select('a'):
  text = link.getText().lower()
  if keyword in text:
    print link['href']

这将以 case-insensitive 但 space 敏感的方式列出普通 HTML 文件中链接中的所有 URL。

但是,如果您尝试解析列出的网站,他们会使用 AJAX 加载实际内容。您链接的 url 实际上并不是您要查找的数据。该页面只是向 https://familysearch.org/search/filmdatainfo 发出 POST 请求,有效负载为:

{"type":"browse-data","args":{"waypointURL":"/recapi/waypoints/Q69L-N6T:116559001,116559002,116559003?cc=1601210","state":{"owc":"Q69L-N6T:116559001,116559002,116559003?cc=1601210","imageOrFilmUrl":"/search/image/index","viewMode":"i","selectedImageIndex":-1,"openWaypointContext":"/recapi/waypoints/Q69L-N6T:116559001,116559002,116559003?cc=1601210"},"locale":"en"}}

您可以解析哪个 returns 一个 JSON 文档。他们似乎试图阻止你这样做,所以最容易使用 Chrome 的 "Copy as cURL" 函数来获得这个:

curl 'https://familysearch.org/search/filmdatainfo' -H 'Origin: https://familysearch.org' -H 'Accept-Encoding: gzip, deflate, br' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36' -H 'Content-Type: application/json' -H 'accept: application/json' -H 'Referer: https://familysearch.org/search/image/index?owc=Q69L-N6T%3A116559001%2C116559002%2C116559003%3Fcc%3D1601210' -H 'Cookie: fssessionid=USYS45D0C1B6E2A42A66B9E4C9F1D0935D2F_idses-prod05.a.fsglobal.net; fs_experiments=u%3D-anon-%2Ca%3Dshared-ui%2Cs%3D23d64fb841c59b75c0737db6b5dd47d0%2Cv%3D11111011110000000000000000000000000000000000000000000000000001011000%2Cb%3D49%26a%3Dsearch%2Cs%3D47d3688c3fc1adc06dc151194bb6e298%2Cv%3D110000001011001110100%2Cb%3D50; fs-tf=1' -H 'Connection: keep-alive' --data-binary '{"type":"browse-data","args":{"waypointURL":"/recapi/waypoints/Q69L-N6T:116559001,116559002,116559003?cc=1601210","state":{"owc":"Q69L-N6T:116559001,116559002,116559003?cc=1601210","imageOrFilmUrl":"/search/image/index","viewMode":"i","selectedImageIndex":-1,"openWaypointContext":"/recapi/waypoints/Q69L-N6T:116559001,116559002,116559003?cc=1601210"},"locale":"en"}}' --compressed

您可以通过管道传输到文件,然后加载:

import json
with open('data.json') as f:
  x = json.load(f)

x 将是一个带有键 containers 的字典,它是一个包含所有 url 和标题的字典列表,每个看起来像这样:

{"url":"https://www.familysearch.org/recapi/waypoints/Q69G-SJC:116559001,116559002,116559003,122762601?cc=1601210","title":"Matrimonios 1879-1888"}

您可以在闲暇时循环播放。