如何在不下载所有页面源的情况下始终获取网站标题

Question

是的，我同意乍一看，这看起来与以下内容完全相同：

How to get webpage title without downloading all the page source
How to get website title from c#

说实话...这个问题与这两个问题极为相关。但是，我注意到我在研究这个特定主题时发现的几乎所有 link 的代码都存在缺陷。

这里有一些其他的link内容与上面的link类似：

如果必须知道，我将使用此特定方法获取页面的 URL，如 link 中所述，但我认为这无关紧要：

Dragging URLs to Windows Forms controls in C#

第一个 link 中的代码运行良好，尽管有一个大问题：

例如，如果我从这个站点获取 URL：http://www.dotnetperls.com/imagelist

并将其传递给代码，我有以下修改版本：

private static string GetWebPageTitle(string url)
{
    HttpWebRequest request = (HttpWebRequest.Create(url) as HttpWebRequest);
    HttpWebResponse response = (request.GetResponse() as HttpWebResponse);
    using (Stream stream = response.GetResponseStream())
    {
        // compiled regex to check for <title></title> block
        Regex titleCheck = new Regex(@"<title>\s*(.+?)\s*</title>", RegexOptions.Compiled | RegexOptions.IgnoreCase);
        int bytesToRead = 8092;
        byte[] buffer = new byte[bytesToRead];
        string contents = "";
        int length = 0;
        while ((length = stream.Read(buffer, 0, bytesToRead)) > 0)
        {
            // convert the byte-array to a string and add it to the rest of the
            // contents that have been downloaded so far
            contents += Encoding.UTF8.GetString(buffer, 0, length);

            Match m = titleCheck.Match(contents);
            if (m.Success)
            {
                // we found a <title></title> match =]
                return m.Groups[1].Value.ToString();
                break;
            }
            else if (contents.Contains("</head>"))
            {
                // reached end of head-block; no title found =[
                return null;
                break;
            }
        }
        return null;
    }
}

它 returns 我是一个空白结果，或 null。然而，当观察页面的 HTML 代码时，标题标签肯定在那里。

因此，我的问题是：如何修改或更正代码，无论是从我修改后的代码，还是从其他四个 link 中的任何一个，都可以获得网页标题来自所有具有标题标签的网页，一个例子是这个问题中的最后一个 link，来自 DotNetPerls 的那个。

我只是在猜测，但我想知道该网站的显示是否与其他典型网站不同，比如当您第一次加载时它可能不显示任何代码，但浏览器实际上在加载后重新加载该网站第一次透明...

如果可能的话，我更希望得到一些工作示例代码的答案。

Answer 1

它与标题不匹配，因为流实际上是 raw 流，在这种情况下，它已被 gzip 压缩。（在循环里面加一个Console.WriteLine(contents)看看）。

要自动解压缩流，请执行以下操作：

request.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;

(自动解压的解决方案取自here)

如何在不下载所有页面源的情况下始终获取网站标题

How to always get the website title without downloading all the page source

c#

.net-4.0

winforms