如何使用 python 从网页中抓取视频 URL?
How to scrape video URL from Webpage using python?
我想从网站下载视频。
这是我的代码。
每次当我 运行 这段代码时,它都是 returns 空白文件。
这是实时代码:https://colab.research.google.com/drive/19NDLYHI2n9rG6KeBCiv9vKXdwb5JL9Nb?usp=sharing
from bs4 import BeautifulSoup
import requests
url = requests.get("https://www.mxtakatak.com/xt0.3a7ed6f84ded3c0f678638602b48bb1b840bea7edb3700d62cebcf7a400d4279/video/20000kCCF0")
page = url.content
soup = BeautifulSoup(page, "html.parser")
#print(soup.prettify())
result = soup.find_all('video', class_="video-player")
print(result)
你总是得到一个空白 return 因为 soup.find_all()
没有找到任何东西。
也许您应该手动检查收到的 url.content,然后决定使用 find_all()
查找什么
编辑:经过一番挖掘后,我发现了如何获得 content_url_orig
:
from bs4 import BeautifulSoup
import requests
import json
url = requests.get("https://www.mxtakatak.com/xt0.3a7ed6f84ded3c0f678638602b48bb1b840bea7edb3700d62cebcf7a400d4279/video/20000kCCF0")
page = url.content
soup = BeautifulSoup(page, "html.parser")
result = str(soup.find_all('script')[1]) #looking for script tag inside the html-file
result = result.split('window._state = ')[1].split("</script>']")[0].split('\n')[0]
#separating the json from the whole script-string, digged around in the file to find out how to do it
result = json.loads(result)
#navigating in the json to get the video-url
entity = list(result['entities'].items())[0][1]
download_url = entity['content_url_orig']
print(download_url)
有趣的旁注:如果我阅读 JSON 正确,您可以找到所有包含 download-URLs 创作者上传的视频:)
使用正则表达式
import requests
import re
response = requests.get("....../video/20000kCCF0")
videoId = '20000kCCF0'
videos = re.findall(r'https://[^"]+' + videoId + '[^"]+mp4', response.text)
print(videos)
我想从网站下载视频。
这是我的代码。 每次当我 运行 这段代码时,它都是 returns 空白文件。 这是实时代码:https://colab.research.google.com/drive/19NDLYHI2n9rG6KeBCiv9vKXdwb5JL9Nb?usp=sharing
from bs4 import BeautifulSoup
import requests
url = requests.get("https://www.mxtakatak.com/xt0.3a7ed6f84ded3c0f678638602b48bb1b840bea7edb3700d62cebcf7a400d4279/video/20000kCCF0")
page = url.content
soup = BeautifulSoup(page, "html.parser")
#print(soup.prettify())
result = soup.find_all('video', class_="video-player")
print(result)
你总是得到一个空白 return 因为 soup.find_all()
没有找到任何东西。
也许您应该手动检查收到的 url.content,然后决定使用 find_all()
编辑:经过一番挖掘后,我发现了如何获得 content_url_orig
:
from bs4 import BeautifulSoup
import requests
import json
url = requests.get("https://www.mxtakatak.com/xt0.3a7ed6f84ded3c0f678638602b48bb1b840bea7edb3700d62cebcf7a400d4279/video/20000kCCF0")
page = url.content
soup = BeautifulSoup(page, "html.parser")
result = str(soup.find_all('script')[1]) #looking for script tag inside the html-file
result = result.split('window._state = ')[1].split("</script>']")[0].split('\n')[0]
#separating the json from the whole script-string, digged around in the file to find out how to do it
result = json.loads(result)
#navigating in the json to get the video-url
entity = list(result['entities'].items())[0][1]
download_url = entity['content_url_orig']
print(download_url)
有趣的旁注:如果我阅读 JSON 正确,您可以找到所有包含 download-URLs 创作者上传的视频:)
使用正则表达式
import requests
import re
response = requests.get("....../video/20000kCCF0")
videoId = '20000kCCF0'
videos = re.findall(r'https://[^"]+' + videoId + '[^"]+mp4', response.text)
print(videos)