如果 <base href...> 设置为双斜杠会怎样？

Question

我想了解如何为我的网络爬虫使用 <base href="" /> 值，所以我测试了几种主要浏览器的组合，最后发现了一些我不理解的双斜杠。

如果您不喜欢阅读所有内容，请跳转到 D 和 E 的测试结果。所有测试的演示：
http://gutt.it/basehref.php

逐步调用我的测试结果 http://example.com/images.html:

A - 多碱基 href

<html>
<head>
<base target="_blank" />
<base href="http://example.com/images/" />
<base href="http://example.com/" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg">
<img src="./image.jpg">
<img src="images/image.jpg"> not found
<img src="/image.jpg"> not found
<img src="../image.jpg"> not found
</body>
</html>

结论

只有第一个 <base> 和 href 算
以 / 开头的源以根为目标
../ 上移一个文件夹

B - 没有尾部斜杠

<html>
<head>
<base href="http://example.com/images" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg"> not found
<img src="./image.jpg"> not found
<img src="images/image.jpg">
<img src="/image.jpg"> not found
<img src="../image.jpg"> not found
</body>
</html>

结论

<base href> 忽略最后一个斜杠后的所有内容，因此 http://example.com/images 变为 http://example.com/

C - 应该如何

<html>
<head>
<base href="http://example.com/" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg"> not found
<img src="./image.jpg"> not found
<img src="images/image.jpg">
<img src="/image.jpg"> not found
<img src="../image.jpg"> not found
</body>
</html>

结论

与测试 B 中的结果相同，符合预期

D - 双斜杠

<html>
<head>
<base href="http://example.com/images//" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg">
<img src="./image.jpg">
<img src="images/image.jpg"> not found
<img src="/image.jpg"> not found
<img src="../image.jpg">
</body>
</html>

E - 带空格的双斜杠

<html>
<head>
<base href="http://example.com/images/ /" />
</head>
<body>
<img src="/images/image.jpg">
<img src="image.jpg"> not found
<img src="./image.jpg"> not found
<img src="images/image.jpg"> not found
<img src="/image.jpg"> not found
<img src="../image.jpg">
</body>
</html>

两者都不是 "valid" 网址，而是我的网络爬虫的真实结果。请解释在 D 和 E 中发生了什么，可以找到 ../image.jpg 以及为什么导致空格不同？

只为你感兴趣：

<base href="http://example.com//" />等同于测试C
<base href="http://example.com/ /" />完全不同。只找到 ../image.jpg
<base href="a/" /> 只找到 /images/image.jpg

Answer 1

base 的行为在 HTML 规范中有解释：

The base element allows authors to specify the document base URL for the purposes of resolving relative URLs.

如你的测试A所示，如果有多个base和href，document base URL将是第一个。

Resolving relative URLs是这样做的：

Apply the URL parser to url, with base as the base URL, with encoding as the encoding.

URL parsing 算法在 URL 规范中定义。

太复杂了，这里就不详细解释了。但基本上，情况是这样的：

以 / 开头的亲戚 URL 是根据基 URL 的主机计算的。
否则，相对 URL 是相对于基础 URL 的最后一个目录计算的。
请注意，如果基本路径不以 / 结尾，最后一部分将是文件，而不是目录。
./为当前目录
../ 上一级目录

^{（可能 "directory" 和 "file" 不是 URL 中的正确术语）}

一些示例：

http://example.com/images/a/./ 是 http://example.com/images/a/
http://example.com/images/a/../ 是 http://example.com/images/
http://example.com/images//./ 是 http://example.com/images//
http://example.com/images//../ 是 http://example.com/images/
http://example.com/images/./ 是 http://example.com/images/
http://example.com/images/../ 是 http://example.com/

请注意，在大多数情况下，// 类似于 /。作为said by @poncha,

Unless you're using some kind of URL rewriting (in which case the rewriting rules may be affected by the number of slashes), the uri maps to a path on disk, but in (most?) modern operating systems (Linux/Unix, Windows), multiple path separators in a row do not have any special meaning, so /path/to/foo and /path//to////foo would eventually map to the same file.

不过一般来说/ /不会变成//.

您可以使用以下代码段将相对 URL 列表解析为绝对列表：

var bases = [
  "http://example.com/images/",
  "http://example.com/images",
  "http://example.com/",
  "http://example.com/images//",
  "http://example.com/images/ /"
];
var urls = [
  "/images/image.jpg",
  "image.jpg",
  "./image.jpg",
  "images/image.jpg",
  "/image.jpg",
  "../image.jpg"
];
function newEl(type, contents) {
  var el = document.createElement(type);
  if(!contents) return el;
  if(!(contents instanceof Array))
    contents = [contents];
  for(var i=0; i<contents.length; ++i)
    if(typeof contents[i] == 'string')
      el.appendChild(document.createTextNode(contents[i]))
    else if(typeof contents[i] == 'object') // contents[i] instanceof Node
      el.appendChild(contents[i])
  return el;
}
function emoticon(str) {
  return {
    'http://example.com/images/image.jpg': 'good',
    'http://example.com/images//image.jpg': 'neutral'
  }[str] || 'bad';
}
var base = document.createElement('base'),
    a = document.createElement('a'),
    output = document.createElement('ul'),
    head = document.getElementsByTagName('head')[0];
head.insertBefore(base, head.firstChild);
for(var i=0; i<bases.length; ++i) {
  base.href = bases[i];
  var test = newEl('li', [
    'Test ' + (i+1) + ': ',
    newEl('span', bases[i])
  ]);
  test.className = 'test';
  var testItems = newEl('ul');
  testItems.className = 'test-items';
  for(var j=0; j<urls.length; ++j) {
    a.href = urls[j];
    var absURL = a.cloneNode(false).href;
      /* Stupid old IE requires cloning
          */
    var testItem = newEl('li', [
      newEl('span', urls[j]),
      ' → ',
      newEl('span', absURL)
    ]);
    testItem.className = 'test-item ' + emoticon(absURL);
    testItems.appendChild(testItem);
  }
  test.appendChild(testItems);
  output.appendChild(test);
}
document.body.appendChild(output);

span {
  background: #eef;
}
.test-items {
  display: table;
  border-spacing: .13em;
  padding-left: 1.1em;
  margin-bottom: .3em;
}
.test-item {
  display: table-row;
  position: relative;
  list-style: none;
}
.test-item > span {
  display: table-cell;
}
.test-item:before {
  display: inline-block;
  width: 1.1em;
  height: 1.1em;
  line-height: 1em;
  text-align: center;
  border-radius: 50%;
  margin-right: .4em;
  position: absolute;
  left: -1.1em;
  top: 0;
}
.good:before {
  content: ':)';
  background: #0f0;
}
.neutral:before {
  content: ':|';
  background: #ff0;
}
.bad:before {
  content: ':(';
  background: #f00;
}

你也可以玩这个片段：

var resolveURL = (function() {
  var base = document.createElement('base'),
      a = document.createElement('a'),
      head = document.getElementsByTagName('head')[0];
  return function(url, baseurl) {
    if(base) {
      base.href = baseurl;
      head.insertBefore(base, head.firstChild);
    }
    a.href = url;
    var abs = a.cloneNode(false).href;
    /* Stupid old IE requires cloning
        */
    if(base)
      head.removeChild(base);
    return abs;
  };
})();
var base = document.getElementById('base'),
    url = document.getElementById('url'),
    abs = document.getElementById('absolute');
base.onpropertychange = url.onpropertychange = function() {
  if (event.propertyName == "value")
    update()
};
(base.oninput = url.oninput = update)();
function update() {
  abs.value = resolveURL(url.value, base.value);
}

label {
  display: block;
  margin: 1em 0;
}
input {
  width: 100%;
}

<label>
  Base url:
  <input id="base" value="http://example.com/images//foo////bar/baz"
         placeholder="Enter your base url here" />
</label>
<label>
  URL to be resolved:
  <input id="url" value="./a/b/../c"
         placeholder="Enter your URL here">
</label>
<label>
  Resulting url:
  <input id="absolute" readonly>
</label>

如果 <base href...> 设置为双斜杠会怎样？

What happens if <base href...> is set with a double slash?

html

html-head