如何从URL中提取参数?
How to extract parameters from URL?
url = 'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/'
url2 = 'https://www.allrecipes.com/recipes/94/soups-stews-and-chili/'
new = url.split("/")[-4:]
new2 = url2.split("/")[-2:]
print(new)
print(new2)
Output : ['world-cuisine', 'asian', 'chinese', '']
['soups-stews-and-chili', '']
- 我需要的输出是['world-cuisine', 'asian', 'chinese'] & ['soups-stews-and-chili'].
- URLs 有不同的参数我无法绕过所有 URL 并且只提取数字后的主要参数
- 还有 URL 末尾的 '/' 是必需的,因为在 Scrapy 中,当我使用 URL w/o '/' 时,它会抛出 301 错误,但正如你从输出中可以看出有一个额外的 '' 因为我不能省略反斜杠。
- 我该怎么做才能获得各种 URL 的参数?
URL 的其他一些示例是:
'https://www.allrecipes.com/recipes/416/seafood/fish/salmon/'
'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/'
我们如何为这样的 URL 编写规则来遵循这样的分页 'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/?page=2'
Rule(LinkExtractor(allow=(r'recipes/?page=\d+',)), follow=True)
我是 scrapy 和 regex 的新手,因此非常感谢你帮助解决这个问题
您可以组合 re
模块 + str.split
:
import re
urls = [
"https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/",
"https://www.allrecipes.com/recipes/94/soups-stews-and-chili/",
"https://www.allrecipes.com/recipes/416/seafood/fish/salmon/",
"https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/",
]
r = re.compile(r"(?:\d+/)(.*)/")
for url in urls:
print(r.search(url).group(1).split("/"))
打印:
['world-cuisine', 'asian', 'chinese']
['soups-stews-and-chili']
['seafood', 'fish', 'salmon']
['meat-and-poultry', 'pork']
我不是 100% 确定我是否正确理解了您的问题,但我认为以下代码可以满足您的需求。
编辑
评论互动后更新代码
urls = [
'https://www.allrecipes.com/recipes/416/seafood/fish/salmon/',
'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/',
'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/',
'https://www.allrecipes.com/recipes/94/soups-stews-and-chili/',
'https://www.allrecipes.com/recipes/qqqq/94/soups-stews-and-chili/x/y/z/q'
]
for url in urls:
for index, part in enumerate(url.split('/')):
if part.isnumeric():
start = index+1
break
print(url.split('/')[start:-1])
输出
['seafood', 'fish', 'salmon']
['meat-and-poultry', 'pork']
['world-cuisine', 'asian', 'chinese']
['soups-stews-and-chili']
['soups-stews-and-chili', 'x', 'y', 'z']
旧答案
urls = [
'https://www.allrecipes.com/recipes/416/seafood/fish/salmon/',
'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/',
'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/',
'https://www.allrecipes.com/recipes/94/soups-stews-and-chili/'
]
for url in urls:
print(url.split("/")[5:-1])
输出
['seafood', 'fish', 'salmon']
['meat-and-poultry', 'pork']
['world-cuisine', 'asian', 'chinese']
['soups-stews-and-chili']
像这样。这个想法是找到 'int' 路径元素并从其右侧获取所有路径元素。
from collections import defaultdict
from typing import Dict, List
urls = ['https://www.allrecipes.com/recipes/416/seafood/fish/salmon/',
'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/']
def is_int(param: str) -> bool:
try:
int(param)
return True
except ValueError:
return False
data: Dict[str, List[str]] = defaultdict(list)
for url in urls:
elements = url.split('/')
elements.reverse()
loop = True
while loop:
for element in elements:
if len(element.strip()) < 1:
continue
if not is_int(element):
data[url].append(element)
else:
loop = False
break
print(data)
输出
defaultdict(<class 'list'>, {'https://www.allrecipes.com/recipes/416/seafood/fish/salmon/': ['salmon', 'fish', 'seafood'], 'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/': ['pork', 'meat-and-poultry']})
在处理 url 时尽量避免(或至少延迟)regex
并首先查看 urllib
或类似的 and/or split()
。
只有一个 url 具有完整的详细信息:
from urllib.parse import urlparse
urlparse(urls[4])
ParseResult(scheme='https', netloc='www.allrecipes.com', path='/recipes/695/world-cuisine/asian/chinese/', params='', query='page=2', fragment='')
仅循环列表 path
和 split()
:
# a list of urls
urls = ['https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/',
'https://www.allrecipes.com/recipes/94/soups-stews-and-chili/',
'https://www.allrecipes.com/recipes/416/seafood/fish/salmon/',
'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/',
'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/?page=2']
for url in urls:
# https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/
l = urlparse(url).path.split('/')
# ['', 'recipes', '695', 'world-cuisine', 'asian', 'chinese', '']
print(l[3:])
# ['world-cuisine', 'asian', 'chinese', '']
print('/'.join(l[3:]),'\n')
# world-cuisine/asian/chinese/
上面的完整输出:
['world-cuisine', 'asian', 'chinese', '']
world-cuisine/asian/chinese/
['soups-stews-and-chili', '']
soups-stews-and-chili/
['seafood', 'fish', 'salmon', '']
seafood/fish/salmon/
['meat-and-poultry', 'pork', '']
meat-and-poultry/pork/
['world-cuisine', 'asian', 'chinese', '']
world-cuisine/asian/chinese/
另一个例子(这次不只是path
):
for parts in urls:
print(list(urlparse(parts)), '\n')
输出:
['https', 'www.allrecipes.com', '/recipes/695/world-cuisine/asian/chinese/', '', '', '']
['https', 'www.allrecipes.com', '/recipes/94/soups-stews-and-chili/', '', '', '']
['https', 'www.allrecipes.com', '/recipes/416/seafood/fish/salmon/', '', '', '']
['https', 'www.allrecipes.com', '/recipes/205/meat-and-poultry/pork/', '', '', '']
['https', 'www.allrecipes.com', '/recipes/695/world-cuisine/asian/chinese/', '', 'page=2', '']
url = 'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/'
url2 = 'https://www.allrecipes.com/recipes/94/soups-stews-and-chili/'
new = url.split("/")[-4:]
new2 = url2.split("/")[-2:]
print(new)
print(new2)
Output : ['world-cuisine', 'asian', 'chinese', '']
['soups-stews-and-chili', '']
- 我需要的输出是['world-cuisine', 'asian', 'chinese'] & ['soups-stews-and-chili'].
- URLs 有不同的参数我无法绕过所有 URL 并且只提取数字后的主要参数
- 还有 URL 末尾的 '/' 是必需的,因为在 Scrapy 中,当我使用 URL w/o '/' 时,它会抛出 301 错误,但正如你从输出中可以看出有一个额外的 '' 因为我不能省略反斜杠。
- 我该怎么做才能获得各种 URL 的参数?
URL 的其他一些示例是:
'https://www.allrecipes.com/recipes/416/seafood/fish/salmon/'
'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/'
我们如何为这样的 URL 编写规则来遵循这样的分页 'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/?page=2'
Rule(LinkExtractor(allow=(r'recipes/?page=\d+',)), follow=True)
我是 scrapy 和 regex 的新手,因此非常感谢你帮助解决这个问题
您可以组合 re
模块 + str.split
:
import re
urls = [
"https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/",
"https://www.allrecipes.com/recipes/94/soups-stews-and-chili/",
"https://www.allrecipes.com/recipes/416/seafood/fish/salmon/",
"https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/",
]
r = re.compile(r"(?:\d+/)(.*)/")
for url in urls:
print(r.search(url).group(1).split("/"))
打印:
['world-cuisine', 'asian', 'chinese']
['soups-stews-and-chili']
['seafood', 'fish', 'salmon']
['meat-and-poultry', 'pork']
我不是 100% 确定我是否正确理解了您的问题,但我认为以下代码可以满足您的需求。
编辑
评论互动后更新代码
urls = [
'https://www.allrecipes.com/recipes/416/seafood/fish/salmon/',
'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/',
'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/',
'https://www.allrecipes.com/recipes/94/soups-stews-and-chili/',
'https://www.allrecipes.com/recipes/qqqq/94/soups-stews-and-chili/x/y/z/q'
]
for url in urls:
for index, part in enumerate(url.split('/')):
if part.isnumeric():
start = index+1
break
print(url.split('/')[start:-1])
输出
['seafood', 'fish', 'salmon']
['meat-and-poultry', 'pork']
['world-cuisine', 'asian', 'chinese']
['soups-stews-and-chili']
['soups-stews-and-chili', 'x', 'y', 'z']
旧答案
urls = [
'https://www.allrecipes.com/recipes/416/seafood/fish/salmon/',
'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/',
'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/',
'https://www.allrecipes.com/recipes/94/soups-stews-and-chili/'
]
for url in urls:
print(url.split("/")[5:-1])
输出
['seafood', 'fish', 'salmon']
['meat-and-poultry', 'pork']
['world-cuisine', 'asian', 'chinese']
['soups-stews-and-chili']
像这样。这个想法是找到 'int' 路径元素并从其右侧获取所有路径元素。
from collections import defaultdict
from typing import Dict, List
urls = ['https://www.allrecipes.com/recipes/416/seafood/fish/salmon/',
'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/']
def is_int(param: str) -> bool:
try:
int(param)
return True
except ValueError:
return False
data: Dict[str, List[str]] = defaultdict(list)
for url in urls:
elements = url.split('/')
elements.reverse()
loop = True
while loop:
for element in elements:
if len(element.strip()) < 1:
continue
if not is_int(element):
data[url].append(element)
else:
loop = False
break
print(data)
输出
defaultdict(<class 'list'>, {'https://www.allrecipes.com/recipes/416/seafood/fish/salmon/': ['salmon', 'fish', 'seafood'], 'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/': ['pork', 'meat-and-poultry']})
在处理 url 时尽量避免(或至少延迟)regex
并首先查看 urllib
或类似的 and/or split()
。
只有一个 url 具有完整的详细信息:
from urllib.parse import urlparse
urlparse(urls[4])
ParseResult(scheme='https', netloc='www.allrecipes.com', path='/recipes/695/world-cuisine/asian/chinese/', params='', query='page=2', fragment='')
仅循环列表 path
和 split()
:
# a list of urls
urls = ['https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/',
'https://www.allrecipes.com/recipes/94/soups-stews-and-chili/',
'https://www.allrecipes.com/recipes/416/seafood/fish/salmon/',
'https://www.allrecipes.com/recipes/205/meat-and-poultry/pork/',
'https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/?page=2']
for url in urls:
# https://www.allrecipes.com/recipes/695/world-cuisine/asian/chinese/
l = urlparse(url).path.split('/')
# ['', 'recipes', '695', 'world-cuisine', 'asian', 'chinese', '']
print(l[3:])
# ['world-cuisine', 'asian', 'chinese', '']
print('/'.join(l[3:]),'\n')
# world-cuisine/asian/chinese/
上面的完整输出:
['world-cuisine', 'asian', 'chinese', '']
world-cuisine/asian/chinese/
['soups-stews-and-chili', '']
soups-stews-and-chili/
['seafood', 'fish', 'salmon', '']
seafood/fish/salmon/
['meat-and-poultry', 'pork', '']
meat-and-poultry/pork/
['world-cuisine', 'asian', 'chinese', '']
world-cuisine/asian/chinese/
另一个例子(这次不只是path
):
for parts in urls:
print(list(urlparse(parts)), '\n')
输出:
['https', 'www.allrecipes.com', '/recipes/695/world-cuisine/asian/chinese/', '', '', '']
['https', 'www.allrecipes.com', '/recipes/94/soups-stews-and-chili/', '', '', '']
['https', 'www.allrecipes.com', '/recipes/416/seafood/fish/salmon/', '', '', '']
['https', 'www.allrecipes.com', '/recipes/205/meat-and-poultry/pork/', '', '', '']
['https', 'www.allrecipes.com', '/recipes/695/world-cuisine/asian/chinese/', '', 'page=2', '']