Small python 3 从给定网站获取 url 的脚本

Question

我喜欢从一个用python写的网站上得到某些links 3.我试过自己写但失败了（作为初学者）。

我希望脚本执行以下操作：

问我 url（即 https://familysearch.org/search/image/index?owc=Q69L-N6T%3A116559001%2C116559002%2C116559003%3Fcc%3D1601210）。
向我询问关键字（不区分大小写，但 space 敏感！），例如 "matrimonios 2000"，以获取给定网站的相应 link。
获取 link 名称中具有 "matrimonios 2000" 的所有 url（在此示例中它将是 27 urls 命名为 "Matrimonios 2000 vol 1" 直到 "Matrimonios 2000 vol 14").
逐行保存对应的urls在名为"urls.txt"的文件中脚本所在的同一文件夹运行.

到目前为止，这是我的代码：

#!/usr/bin/env python3

import urllib2
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

url = input('Please, enter url: ')
try:
    keyword = string(input('Type keyword(s): '))
except ValueError:
    print('You must enter a string value.')

driver = webdriver.Firefox()
urls = driver.find_elements_by_xpath('keyword')
for url in urls:
    print url.get_attribute("href")

file = open('urls.txt', 'w')
f.write(url)
f.close()

Answer 1

一般的答案是这样的：

url = 'https://familysearch.org/search/image/index?owc=Q69L-N6T%3A116559001%2C116559002%2C116559003%3Fcc%3D1601210'
keyword = 'matrimonios 2000'

html = requests.get(url).content
soup = BeautifulSoup(html)
for link in soup.select('a'):
  text = link.getText().lower()
  if keyword in text:
    print link['href']

这将以 case-insensitive 但 space 敏感的方式列出普通 HTML 文件中链接中的所有 URL。

但是，如果您尝试解析列出的网站，他们会使用 AJAX 加载实际内容。您链接的 url 实际上并不是您要查找的数据。该页面只是向 https://familysearch.org/search/filmdatainfo 发出 POST 请求，有效负载为：

{"type":"browse-data","args":{"waypointURL":"/recapi/waypoints/Q69L-N6T:116559001,116559002,116559003?cc=1601210","state":{"owc":"Q69L-N6T:116559001,116559002,116559003?cc=1601210","imageOrFilmUrl":"/search/image/index","viewMode":"i","selectedImageIndex":-1,"openWaypointContext":"/recapi/waypoints/Q69L-N6T:116559001,116559002,116559003?cc=1601210"},"locale":"en"}}

您可以解析哪个 returns 一个 JSON 文档。他们似乎试图阻止你这样做，所以最容易使用 Chrome 的 "Copy as cURL" 函数来获得这个：

curl 'https://familysearch.org/search/filmdatainfo' -H 'Origin: https://familysearch.org' -H 'Accept-Encoding: gzip, deflate, br' -H 'Accept-Language: en-US,en;q=0.8' -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.96 Safari/537.36' -H 'Content-Type: application/json' -H 'accept: application/json' -H 'Referer: https://familysearch.org/search/image/index?owc=Q69L-N6T%3A116559001%2C116559002%2C116559003%3Fcc%3D1601210' -H 'Cookie: fssessionid=USYS45D0C1B6E2A42A66B9E4C9F1D0935D2F_idses-prod05.a.fsglobal.net; fs_experiments=u%3D-anon-%2Ca%3Dshared-ui%2Cs%3D23d64fb841c59b75c0737db6b5dd47d0%2Cv%3D11111011110000000000000000000000000000000000000000000000000001011000%2Cb%3D49%26a%3Dsearch%2Cs%3D47d3688c3fc1adc06dc151194bb6e298%2Cv%3D110000001011001110100%2Cb%3D50; fs-tf=1' -H 'Connection: keep-alive' --data-binary '{"type":"browse-data","args":{"waypointURL":"/recapi/waypoints/Q69L-N6T:116559001,116559002,116559003?cc=1601210","state":{"owc":"Q69L-N6T:116559001,116559002,116559003?cc=1601210","imageOrFilmUrl":"/search/image/index","viewMode":"i","selectedImageIndex":-1,"openWaypointContext":"/recapi/waypoints/Q69L-N6T:116559001,116559002,116559003?cc=1601210"},"locale":"en"}}' --compressed

您可以通过管道传输到文件，然后加载：

import json
with open('data.json') as f:
  x = json.load(f)

x 将是一个带有键 containers 的字典，它是一个包含所有 url 和标题的字典列表，每个看起来像这样：

{"url":"https://www.familysearch.org/recapi/waypoints/Q69G-SJC:116559001,116559002,116559003,122762601?cc=1601210","title":"Matrimonios 1879-1888"}

您可以在闲暇时循环播放。

Small python 3 从给定网站获取 url 的脚本

Small python 3 script to fetch urls from given website

python

url

fetch