如何使用正则表达式来帮助抓取网络数据？

Question

我正在尝试获取单个 YouTube 视频页面上的 URL。 youtube-dl 可以做到这一点，但我只需要 url，所以我想学习如何做到这一点。

这是我获取页面源码的代码：source = requests.get("https://www.youtube.com/watch?v=zXif_9RVadI")

我正在寻找 21. 此代码行：source_line_21 = source.text.split("\n")[20]

所有以 https://r[0-9] 开头并包括 googlevideo.com/videoplayback 并以 ","

结尾的网址

我尝试了很多代码，但总是得到 0 或 1 个匹配项。但是有15-20场比赛。

re.match(r'https:\/\/.*googlevideo.com/videoplayback.*mimeType', source_line_21)

我不擅长正则表达式，我搞不好。谢谢大家。

print(source_line_21)[:32600] 的输出我正在这里搜索。它太长了，所以我粘贴到那里：print(source_line_21)[:32600]

Answer 1

使用它

re.match(r'https:\/\/r[0-9][\w\-@%]*googlevideo.com/videoplayback","$', source_line_21)

Answer 2

我找到了解决方案，但实际上不是我想要的。

import re
from urlextract import URLExtract

source = requests.get("https://www.youtube.com/watch?v=zXif_9RVadI")

source_line_21 = source.text.split("\n")[20]

sonuc = re.findall('https:\/\/r[0-9].*\SmimeType', source_line_21)

extractor = URLExtract()
aa = [x for x in extractor.find_urls(sonuc[0]) if "mime=audio" in x]

此代码将为我提供 mime=audio 格式的所有 URL。我使用了 URLExtract 模块，它是外部的而不是内置的。所以，我还在寻找更好的方法来解决我的问题。

Answer 3

您要执行的操作稍微复杂一点；但可以通过使用下面列出的几个工具来简化。

我在示例中使用了 urllib，因为我的 requests 请求带回了 Google 的“在您继续访问 YouTube 之前”cookie 确认页面，但是 urllib 让我绕过了那个垃圾

工具：

urllib（或）requests
BeautifulSoup - 通过bs4图书馆
Regex - 通过 re 库
JSON - 通过 json 库

逻辑：

抓取网站数据
使用 BeautifulSoup
提取感兴趣的标签
遍历标签并使用正则表达式
遍历变量的内容（使用 JSON）以获取 URL

代码：

# Using urllib to read site content. 
source = urllib.request.urlopen("https://www.youtube.com/watch?v=zXif_9RVadI").read().decode()
# Parse HTML using BeautifulSoup
soup = bs(source, features='html.parser')
# Extract all <script> tags.
scripts = soup.findAll('script')
# Build regex pattern to extract the <script> tag's content.
exp = re.compile(r'^var\sytInitialPlayerResponse\s=\s(?P<content>.*\})')

# Iterate through all scripts to find the one with video content.
for s in scripts:
    if s.string:
        m = re.match(exp, s.string)
        if m:
            data = m.groupdict().get('content')

# Extract <script> of interest's content into JSON format.
content = json.loads(data)

# Collect all URIs into a list.
urls = []
for fmt in ['formats', 'adaptiveFormats']:
    for ele in content['streamingData'][fmt]:
        urls.append(ele['url'])

确认 URI：

# Print the detected URIs:
for i, url in enumerate(urls, 1):
    print(i, url[:75])

1 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
2 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
3 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
4 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
5 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
6 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
7 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
8 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
9 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
10 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
11 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
12 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
13 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
14 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
15 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202
16 https://r2---sn-8pgbpohxqp5-cimd.googlevideo.com/videoplayback?expire=16202

Answer 4

您可以使用

re.findall(r'https://r[0-9][^"]*', text)
re.findall(r'https://r[0-9][^"]*', text, re.I)  # case insensitive

参见regex demo。

详情

https:// - https:// 字符串（如果要匹配 http://，则在 s 后添加 ?：https?://）
r - 一个 r 字符
[0-9] - 一个数字
[^"]* - " 字符以外的零个或多个字符。

如何使用正则表达式来帮助抓取网络数据？

How can I use regex to help scrape web data?

python

regex

python-3.x

python-re

工具：

逻辑：

代码：

确认 URI：