用于查找 iframe 标记和检索属性的正则表达式

Question

我正在尝试从 HTML 输入中检索 iframe 标签和属性。

示例输入

<div class="1"><iframe width="100%" height="427px" src="https://www.youtube.com/embed/1" frameborder="0" allowfullscreen=""></iframe></div>
<div class="2"><iframe width="100%" height="427px" src="https://www.youtube.com/embed/2" frameborder="0" allowfullscreen=""></iframe></div>

我一直在尝试使用以下正则表达式收集它们：

<iframe.+?width=[\"'](?<width>.*?)[\"']?height=[\"'](?<height>.*?)[\"']?src=[\"'](?<src>.*?)[\"'].+?>

这导致

这正是我想要的格式。

问题是，如果 HTML 属性的顺序不同，则此正则表达式将不起作用。

有什么方法可以修改此正则表达式以忽略属性顺序和 return 分组在 Matches 中的 iframe 以便我可以遍历它们？

Answer 1

正则表达式匹配模式，字符串的结构定义了要使用的模式，因此，如果您想使用正则表达式，顺序很重要。

您可以通过两种方式处理此问题：

推荐的方法是不用正则表达式解析HTML(mandatory link), but rather use a parsing framework such as the HTML Agility Pack。这应该允许你处理你需要的 HTML 并提取你想要的任何值。
第二个 不好且不推荐的方法 是将您的匹配分成两部分。您首先使用类似这样的东西：<iframe(.+?)></iframe> 来提取 entire iframe decleration，然后使用多个较小的正则表达式来寻找并找到您想要的设置。如果您的 iframe 结构如下：<iframe.../>，上述正则表达式显然会失败。这应该给你一个提示，关于 为什么 你不应该通过正则表达式进行 HTMl 解析。

如上所述，您应该选择第一个选项。

Answer 2

您需要使用或运算符 (|)。查看下面的更改

<iframe.+?width=[\"']((?<width>.*?)[\"']?)|(height=[\"'](?<height>.*?)[\"']?)|(src=[\"'](?<src>.*?)[\"']))*.+?>

Answer 3

你可以使用这个正则表达式

<iframe[ ]+(([a-z]+) *= *['"]*([a-zA-Z0-9\/:\.%]*)['"]*[ ]*)*>

它以递归方式匹配每个 'name'='value' 对，并以相同的顺序将其存储在匹配项中，您可以遍历数学以按顺序获取名称和值。满足大多数字符的价值，但如果需要，您可以添加更多字符。

Answer 4

使用 Html Agility Pack（通过 nuget 获得）：

using System;
using HtmlAgilityPack;

namespace Demo
{
    class Program
    {
        static void Main(string[] args)
        {
            HtmlDocument doc = new HtmlDocument();
            doc.Load("HTMLPage1.html"); //or .LoadHtml(/*contentstring*/);

            HtmlNodeCollection iframes = doc.DocumentNode.SelectNodes("//iframe");

            foreach (HtmlNode iframe in iframes)
            {
                Console.WriteLine(iframe.GetAttributeValue("width","null"));
                Console.WriteLine(iframe.GetAttributeValue("height", "null"));
                Console.WriteLine(iframe.GetAttributeValue("src","null"));
            }

        }
    }
}

Answer 5

这是一个忽略属性顺序的正则表达式：

(?<=<iframe[^>]*?)(?:\s*width=["'](?<width>[^"']+)["']|\s*height=["'](?<height>[^'"]+)["']|\s*src=["'](?<src>[^'"]+["']))+[^>]*?>

RegexStorm demo

C#示例代码：

var rx = new Regex(@"(?<=<iframe[^>]*?)(?:\s*width=[""'](?<width>[^""']+)[""']|\s*height=[""'](?<height>[^'""]+)[""']|\s*src=[""'](?<src>[^'""]+[""']))+[^>]*?>");
var input = @"YOUR INPUT STRING";
var matches = rx.Matches(input).Cast<Match>().ToList();

输出：

用于查找 iframe 标记和检索属性的正则表达式

Regex to find iframe tags and retrieve attributes

.net

c#

regex