如果 <base href...> 设置为双斜杠会怎样?
What happens if <base href...> is set with a double slash?
我想了解如何为我的网络爬虫使用 <base href="" />
值,所以我测试了几种主要浏览器的组合,最后发现了一些我不理解的双斜杠。
如果您不喜欢阅读所有内容,请跳转到 D 和 E 的测试结果。所有测试的演示:
http://gutt.it/basehref.php
逐步调用我的测试结果 http://example.com/images.html
:
A - 多碱基 href
<html>
<head>
<base target="_blank" />
<base href="http://example.com/images/" />
<base href="http://example.com/" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg">
<img src="./image.jpg">
<img src="images/image.jpg"> not found
<img src="/image.jpg"> not found
<img src="../image.jpg"> not found
</body>
</html>
结论
- 只有第一个
<base>
和 href
算
- 以
/
开头的源以根为目标
../
上移一个文件夹
B - 没有尾部斜杠
<html>
<head>
<base href="http://example.com/images" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg"> not found
<img src="./image.jpg"> not found
<img src="images/image.jpg">
<img src="/image.jpg"> not found
<img src="../image.jpg"> not found
</body>
</html>
结论
<base href>
忽略最后一个斜杠后的所有内容,因此 http://example.com/images
变为 http://example.com/
C - 应该如何
<html>
<head>
<base href="http://example.com/" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg"> not found
<img src="./image.jpg"> not found
<img src="images/image.jpg">
<img src="/image.jpg"> not found
<img src="../image.jpg"> not found
</body>
</html>
结论
- 与测试 B 中的结果相同,符合预期
D - 双斜杠
<html>
<head>
<base href="http://example.com/images//" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg">
<img src="./image.jpg">
<img src="images/image.jpg"> not found
<img src="/image.jpg"> not found
<img src="../image.jpg">
</body>
</html>
E - 带空格的双斜杠
<html>
<head>
<base href="http://example.com/images/ /" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg"> not found
<img src="./image.jpg"> not found
<img src="images/image.jpg"> not found
<img src="/image.jpg"> not found
<img src="../image.jpg">
</body>
</html>
两者都不是 "valid" 网址,而是我的网络爬虫的真实结果。请解释在 D 和 E 中发生了什么,可以找到 ../image.jpg
以及为什么导致空格不同?
只为你感兴趣:
<base href="http://example.com//" />
等同于测试C
<base href="http://example.com/ /" />
完全不同。只找到 ../image.jpg
<base href="a/" />
只找到 /images/image.jpg
base
的行为在 HTML 规范中有解释:
The base
element allows authors to specify the document base
URL for the purposes of resolving relative URLs.
如你的测试A所示,如果有多个base
和href
,document base URL将是第一个。
Resolving relative URLs是这样做的:
Apply the URL parser to url, with base as the base URL, with encoding as the encoding.
URL parsing 算法在 URL 规范中定义。
太复杂了,这里就不详细解释了。但基本上,情况是这样的:
- 以
/
开头的亲戚 URL 是根据基 URL 的主机计算的。
- 否则,相对 URL 是相对于基础 URL 的最后一个目录计算的。
- 请注意,如果基本路径不以
/
结尾,最后一部分将是文件,而不是目录。
./
为当前目录
../
上一级目录
(可能 "directory" 和 "file" 不是 URL 中的正确术语)
一些示例:
http://example.com/images/a/./
是 http://example.com/images/a/
http://example.com/images/a/../
是 http://example.com/images/
http://example.com/images//./
是 http://example.com/images//
http://example.com/images//../
是 http://example.com/images/
http://example.com/images/./
是 http://example.com/images/
http://example.com/images/../
是 http://example.com/
请注意,在大多数情况下,//
类似于 /
。作为said by @poncha,
Unless you're using some kind of URL rewriting (in which case the
rewriting rules may be affected by the number of slashes), the uri
maps to a path on disk, but in (most?) modern operating systems
(Linux/Unix, Windows), multiple path separators in a row do not have
any special meaning, so /path/to/foo and /path//to////foo would
eventually map to the same file.
不过一般来说/ /
不会变成//
.
您可以使用以下代码段将相对 URL 列表解析为绝对列表:
var bases = [
"http://example.com/images/",
"http://example.com/images",
"http://example.com/",
"http://example.com/images//",
"http://example.com/images/ /"
];
var urls = [
"/images/image.jpg",
"image.jpg",
"./image.jpg",
"images/image.jpg",
"/image.jpg",
"../image.jpg"
];
function newEl(type, contents) {
var el = document.createElement(type);
if(!contents) return el;
if(!(contents instanceof Array))
contents = [contents];
for(var i=0; i<contents.length; ++i)
if(typeof contents[i] == 'string')
el.appendChild(document.createTextNode(contents[i]))
else if(typeof contents[i] == 'object') // contents[i] instanceof Node
el.appendChild(contents[i])
return el;
}
function emoticon(str) {
return {
'http://example.com/images/image.jpg': 'good',
'http://example.com/images//image.jpg': 'neutral'
}[str] || 'bad';
}
var base = document.createElement('base'),
a = document.createElement('a'),
output = document.createElement('ul'),
head = document.getElementsByTagName('head')[0];
head.insertBefore(base, head.firstChild);
for(var i=0; i<bases.length; ++i) {
base.href = bases[i];
var test = newEl('li', [
'Test ' + (i+1) + ': ',
newEl('span', bases[i])
]);
test.className = 'test';
var testItems = newEl('ul');
testItems.className = 'test-items';
for(var j=0; j<urls.length; ++j) {
a.href = urls[j];
var absURL = a.cloneNode(false).href;
/* Stupid old IE requires cloning
*/
var testItem = newEl('li', [
newEl('span', urls[j]),
' → ',
newEl('span', absURL)
]);
testItem.className = 'test-item ' + emoticon(absURL);
testItems.appendChild(testItem);
}
test.appendChild(testItems);
output.appendChild(test);
}
document.body.appendChild(output);
span {
background: #eef;
}
.test-items {
display: table;
border-spacing: .13em;
padding-left: 1.1em;
margin-bottom: .3em;
}
.test-item {
display: table-row;
position: relative;
list-style: none;
}
.test-item > span {
display: table-cell;
}
.test-item:before {
display: inline-block;
width: 1.1em;
height: 1.1em;
line-height: 1em;
text-align: center;
border-radius: 50%;
margin-right: .4em;
position: absolute;
left: -1.1em;
top: 0;
}
.good:before {
content: ':)';
background: #0f0;
}
.neutral:before {
content: ':|';
background: #ff0;
}
.bad:before {
content: ':(';
background: #f00;
}
你也可以玩这个片段:
var resolveURL = (function() {
var base = document.createElement('base'),
a = document.createElement('a'),
head = document.getElementsByTagName('head')[0];
return function(url, baseurl) {
if(base) {
base.href = baseurl;
head.insertBefore(base, head.firstChild);
}
a.href = url;
var abs = a.cloneNode(false).href;
/* Stupid old IE requires cloning
*/
if(base)
head.removeChild(base);
return abs;
};
})();
var base = document.getElementById('base'),
url = document.getElementById('url'),
abs = document.getElementById('absolute');
base.onpropertychange = url.onpropertychange = function() {
if (event.propertyName == "value")
update()
};
(base.oninput = url.oninput = update)();
function update() {
abs.value = resolveURL(url.value, base.value);
}
label {
display: block;
margin: 1em 0;
}
input {
width: 100%;
}
<label>
Base url:
<input id="base" value="http://example.com/images//foo////bar/baz"
placeholder="Enter your base url here" />
</label>
<label>
URL to be resolved:
<input id="url" value="./a/b/../c"
placeholder="Enter your URL here">
</label>
<label>
Resulting url:
<input id="absolute" readonly>
</label>
我想了解如何为我的网络爬虫使用 <base href="" />
值,所以我测试了几种主要浏览器的组合,最后发现了一些我不理解的双斜杠。
如果您不喜欢阅读所有内容,请跳转到 D 和 E 的测试结果。所有测试的演示:
http://gutt.it/basehref.php
逐步调用我的测试结果 http://example.com/images.html
:
A - 多碱基 href
<html>
<head>
<base target="_blank" />
<base href="http://example.com/images/" />
<base href="http://example.com/" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg">
<img src="./image.jpg">
<img src="images/image.jpg"> not found
<img src="/image.jpg"> not found
<img src="../image.jpg"> not found
</body>
</html>
结论
- 只有第一个
<base>
和href
算 - 以
/
开头的源以根为目标 ../
上移一个文件夹
B - 没有尾部斜杠
<html>
<head>
<base href="http://example.com/images" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg"> not found
<img src="./image.jpg"> not found
<img src="images/image.jpg">
<img src="/image.jpg"> not found
<img src="../image.jpg"> not found
</body>
</html>
结论
<base href>
忽略最后一个斜杠后的所有内容,因此http://example.com/images
变为http://example.com/
C - 应该如何
<html>
<head>
<base href="http://example.com/" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg"> not found
<img src="./image.jpg"> not found
<img src="images/image.jpg">
<img src="/image.jpg"> not found
<img src="../image.jpg"> not found
</body>
</html>
结论
- 与测试 B 中的结果相同,符合预期
D - 双斜杠
<html>
<head>
<base href="http://example.com/images//" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg">
<img src="./image.jpg">
<img src="images/image.jpg"> not found
<img src="/image.jpg"> not found
<img src="../image.jpg">
</body>
</html>
E - 带空格的双斜杠
<html>
<head>
<base href="http://example.com/images/ /" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg"> not found
<img src="./image.jpg"> not found
<img src="images/image.jpg"> not found
<img src="/image.jpg"> not found
<img src="../image.jpg">
</body>
</html>
两者都不是 "valid" 网址,而是我的网络爬虫的真实结果。请解释在 D 和 E 中发生了什么,可以找到 ../image.jpg
以及为什么导致空格不同?
只为你感兴趣:
<base href="http://example.com//" />
等同于测试C<base href="http://example.com/ /" />
完全不同。只找到../image.jpg
<base href="a/" />
只找到/images/image.jpg
base
的行为在 HTML 规范中有解释:
The
base
element allows authors to specify the document base URL for the purposes of resolving relative URLs.
如你的测试A所示,如果有多个base
和href
,document base URL将是第一个。
Resolving relative URLs是这样做的:
Apply the URL parser to url, with base as the base URL, with encoding as the encoding.
URL parsing 算法在 URL 规范中定义。
太复杂了,这里就不详细解释了。但基本上,情况是这样的:
- 以
/
开头的亲戚 URL 是根据基 URL 的主机计算的。 - 否则,相对 URL 是相对于基础 URL 的最后一个目录计算的。
- 请注意,如果基本路径不以
/
结尾,最后一部分将是文件,而不是目录。 ./
为当前目录../
上一级目录
(可能 "directory" 和 "file" 不是 URL 中的正确术语)
一些示例:
http://example.com/images/a/./
是http://example.com/images/a/
http://example.com/images/a/../
是http://example.com/images/
http://example.com/images//./
是http://example.com/images//
http://example.com/images//../
是http://example.com/images/
http://example.com/images/./
是http://example.com/images/
http://example.com/images/../
是http://example.com/
请注意,在大多数情况下,//
类似于 /
。作为said by @poncha,
Unless you're using some kind of URL rewriting (in which case the rewriting rules may be affected by the number of slashes), the uri maps to a path on disk, but in (most?) modern operating systems (Linux/Unix, Windows), multiple path separators in a row do not have any special meaning, so /path/to/foo and /path//to////foo would eventually map to the same file.
不过一般来说/ /
不会变成//
.
您可以使用以下代码段将相对 URL 列表解析为绝对列表:
var bases = [
"http://example.com/images/",
"http://example.com/images",
"http://example.com/",
"http://example.com/images//",
"http://example.com/images/ /"
];
var urls = [
"/images/image.jpg",
"image.jpg",
"./image.jpg",
"images/image.jpg",
"/image.jpg",
"../image.jpg"
];
function newEl(type, contents) {
var el = document.createElement(type);
if(!contents) return el;
if(!(contents instanceof Array))
contents = [contents];
for(var i=0; i<contents.length; ++i)
if(typeof contents[i] == 'string')
el.appendChild(document.createTextNode(contents[i]))
else if(typeof contents[i] == 'object') // contents[i] instanceof Node
el.appendChild(contents[i])
return el;
}
function emoticon(str) {
return {
'http://example.com/images/image.jpg': 'good',
'http://example.com/images//image.jpg': 'neutral'
}[str] || 'bad';
}
var base = document.createElement('base'),
a = document.createElement('a'),
output = document.createElement('ul'),
head = document.getElementsByTagName('head')[0];
head.insertBefore(base, head.firstChild);
for(var i=0; i<bases.length; ++i) {
base.href = bases[i];
var test = newEl('li', [
'Test ' + (i+1) + ': ',
newEl('span', bases[i])
]);
test.className = 'test';
var testItems = newEl('ul');
testItems.className = 'test-items';
for(var j=0; j<urls.length; ++j) {
a.href = urls[j];
var absURL = a.cloneNode(false).href;
/* Stupid old IE requires cloning
*/
var testItem = newEl('li', [
newEl('span', urls[j]),
' → ',
newEl('span', absURL)
]);
testItem.className = 'test-item ' + emoticon(absURL);
testItems.appendChild(testItem);
}
test.appendChild(testItems);
output.appendChild(test);
}
document.body.appendChild(output);
span {
background: #eef;
}
.test-items {
display: table;
border-spacing: .13em;
padding-left: 1.1em;
margin-bottom: .3em;
}
.test-item {
display: table-row;
position: relative;
list-style: none;
}
.test-item > span {
display: table-cell;
}
.test-item:before {
display: inline-block;
width: 1.1em;
height: 1.1em;
line-height: 1em;
text-align: center;
border-radius: 50%;
margin-right: .4em;
position: absolute;
left: -1.1em;
top: 0;
}
.good:before {
content: ':)';
background: #0f0;
}
.neutral:before {
content: ':|';
background: #ff0;
}
.bad:before {
content: ':(';
background: #f00;
}
你也可以玩这个片段:
var resolveURL = (function() {
var base = document.createElement('base'),
a = document.createElement('a'),
head = document.getElementsByTagName('head')[0];
return function(url, baseurl) {
if(base) {
base.href = baseurl;
head.insertBefore(base, head.firstChild);
}
a.href = url;
var abs = a.cloneNode(false).href;
/* Stupid old IE requires cloning
*/
if(base)
head.removeChild(base);
return abs;
};
})();
var base = document.getElementById('base'),
url = document.getElementById('url'),
abs = document.getElementById('absolute');
base.onpropertychange = url.onpropertychange = function() {
if (event.propertyName == "value")
update()
};
(base.oninput = url.oninput = update)();
function update() {
abs.value = resolveURL(url.value, base.value);
}
label {
display: block;
margin: 1em 0;
}
input {
width: 100%;
}
<label>
Base url:
<input id="base" value="http://example.com/images//foo////bar/baz"
placeholder="Enter your base url here" />
</label>
<label>
URL to be resolved:
<input id="url" value="./a/b/../c"
placeholder="Enter your URL here">
</label>
<label>
Resulting url:
<input id="absolute" readonly>
</label>