如何使用正则表达式查找 HTML 文本中的所有网址？

Question

我有一个网页的源代码保存在一个文本框中，我想搜索文本框并将所有来自该站点自己的域 (www.test.com) 的链接放入一个字符串列表中。

示例：

文本框在源代码中包含以下链接

a href="index.html
a href="www.test.com/about_us.html
a href="mailto:test@test.com
a href="www.google.com/partners.html

我想提取 index.html 和 about_us.html 并将它们放入字符串列表中。

我试过：

    For Each i As Match In Regex.Matches(TextBox2.Text, "\b" + url + "\b")
        list1.Add(i.Value)
    Next

但似乎无法正常工作，如有任何帮助，我们将不胜感激。

Answer 1

像这样：

Imports System.Text.RegularExpressions
Public Class Form1

    Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click
        Dim Pattern As String = "href=" & """" & "(w{3}.\w+.\w{3})"
        Dim MyString As New Collection
        Dim regex As New Regex(Pattern, RegexOptions.Multiline)
        For Each match In regex.Matches(TextBox1.Text)
            MyString.Add(match.Groups(1).ToString)

        Next

    End Sub
End Class

先决条件：

您有以下内容：

名为 Form1 的表单
一个名为 Button1 的按钮
A TextBox Naled TextBox1

VB 将在您创建这些对象时默认插入这些名称。

private sub 将处理按钮单击事件并将匹配项存储在字符串集合中。

您可以在一个msgbox中添加以下代码回显

For Each member in MyString
Msgbox(Member)
Next

Answer 2

试试这个正则表达式：

<a.+?href\s*=\s*(["'])(?<href>.+?)[^>]*>

带有 IgnoreCase 标志。

DEMO

Answer 3

我最终使用了 Stephan 的答案并从 regexhero

获得了我需要的代码

        Dim strRegex As String = "<?href\s*=\s*[""'].+?[""'][^>]*?"
        Dim myRegex As New Regex(strRegex, RegexOptions.None)
        For Each myMatch As Match In myRegex.Matches(TextBox1.Text)
            If myMatch.Success Then
                ' Add your code here
            End If
        Next

如何使用正则表达式查找 HTML 文本中的所有网址？

How to find all urls in an HTML text with regex?

regex

vb.net

string

search