如何使用 Python 抓取 HTML 的 NOWTV 可用电影
How to scrape HTML using Python for NOWTV available movies
我正在创建一个研究数据集,随着时间的推移,它将为我提供 NOWTV 上可用的电影名称。
这将来自 URL (https://www.nowtv.com/stream/all-movies)
输出为每部可用的电影。
不确定从哪里开始,想使用 Python 和 Beautiful Soup。任何帮助都会很棒。谢谢。
到目前为止的代码:
from bs4 import BeautifulSoup
import urllib2
url = "https://www.nowtv.com/stream/all-movies"
data = urllib2.urlopen(url).read()
我不确定您的预期输出是什么。你的意思是这样的吗?
from bs4 import BeautifulSoup
import requests
link = "https://www.nowtv.com/stream/all-movies"
r = requests.get(link)
page = BeautifulSoup(r.content, "html.parser")
for dd in page.find_all("div", {"class":"ib-card-info-container"}):
title = dd.find(class_="ib-card-title ib-colour-v1_white").text.strip()
date = dd.find(class_="ib-card-availability-container ib-colour-20Grey").text.strip()
print(title + " --> " + date)
您可以根据分页结果 (https://www.nowtv.com/stream/all-movies/page/1) 模仿页面正在执行的操作,并从每个页面的脚本标记中提取电影。尽管下面可以使用一些重构,但它显示了如何获取影片总数、计算每页的影片数以及使用 Session 发出获取所有影片的请求以提高效率。结果是 1425 部电影。
import requests
import re
import json
import math
import pandas as pd
titles = []
links = []
base = 'https://www.nowtv.com'
headers = {'User-Agent' : 'Mozilla/5.0'}
with requests.Session() as s:
res = s.get('https://www.nowtv.com/stream/all-movies/page/1')
r = re.compile(r"var propStore = (.*);")
data = json.loads(r.findall(res.text)[0])
first_section = data[next(iter(data))]
movies_section = first_section['props']['data']['list']
movies_per_page = len(movies_section)
total_movies = int(first_section['props']['data']['count'])
pages = math.ceil(total_movies / movies_per_page)
for movie in movies_section:
titles.append(movie['title'])
links.append(base + movie['slug'])
if pages > 1:
for page in range(2, pages + 1):
res = s.get('https://www.nowtv.com/stream/all-movies/page/{}'.format(page))
r = re.compile(r"var propStore = (.*);")
data = json.loads(r.findall(res.text)[0])
first_section = data[next(iter(data))]
movies_section = first_section['props']['data']['list']
for movie in movies_section:
titles.append(movie['title'])
links.append(base + movie['slug'])
df = pd.DataFrame(list(zip(titles, links)), columns = ['Title', 'Link'])
我正在创建一个研究数据集,随着时间的推移,它将为我提供 NOWTV 上可用的电影名称。
这将来自 URL (https://www.nowtv.com/stream/all-movies)
输出为每部可用的电影。
不确定从哪里开始,想使用 Python 和 Beautiful Soup。任何帮助都会很棒。谢谢。
到目前为止的代码:
from bs4 import BeautifulSoup
import urllib2
url = "https://www.nowtv.com/stream/all-movies"
data = urllib2.urlopen(url).read()
我不确定您的预期输出是什么。你的意思是这样的吗?
from bs4 import BeautifulSoup
import requests
link = "https://www.nowtv.com/stream/all-movies"
r = requests.get(link)
page = BeautifulSoup(r.content, "html.parser")
for dd in page.find_all("div", {"class":"ib-card-info-container"}):
title = dd.find(class_="ib-card-title ib-colour-v1_white").text.strip()
date = dd.find(class_="ib-card-availability-container ib-colour-20Grey").text.strip()
print(title + " --> " + date)
您可以根据分页结果 (https://www.nowtv.com/stream/all-movies/page/1) 模仿页面正在执行的操作,并从每个页面的脚本标记中提取电影。尽管下面可以使用一些重构,但它显示了如何获取影片总数、计算每页的影片数以及使用 Session 发出获取所有影片的请求以提高效率。结果是 1425 部电影。
import requests
import re
import json
import math
import pandas as pd
titles = []
links = []
base = 'https://www.nowtv.com'
headers = {'User-Agent' : 'Mozilla/5.0'}
with requests.Session() as s:
res = s.get('https://www.nowtv.com/stream/all-movies/page/1')
r = re.compile(r"var propStore = (.*);")
data = json.loads(r.findall(res.text)[0])
first_section = data[next(iter(data))]
movies_section = first_section['props']['data']['list']
movies_per_page = len(movies_section)
total_movies = int(first_section['props']['data']['count'])
pages = math.ceil(total_movies / movies_per_page)
for movie in movies_section:
titles.append(movie['title'])
links.append(base + movie['slug'])
if pages > 1:
for page in range(2, pages + 1):
res = s.get('https://www.nowtv.com/stream/all-movies/page/{}'.format(page))
r = re.compile(r"var propStore = (.*);")
data = json.loads(r.findall(res.text)[0])
first_section = data[next(iter(data))]
movies_section = first_section['props']['data']['list']
for movie in movies_section:
titles.append(movie['title'])
links.append(base + movie['slug'])
df = pd.DataFrame(list(zip(titles, links)), columns = ['Title', 'Link'])