正则表达式在另一个标签内的页面上首先提取 link

Question

我一直在尝试设置一个简单的 PHP API，它基本上分两步从另一个站点检索信息。如果一个人要这样做，将涉及：

正在搜索网站
点击第一个结果
查找信息

该网站是以可预测的方式设置的。我知道搜索网站的格式是什么，所以我可以使用 PHP 和 API.

的输入创建搜索 URL

步骤 1/2 的 link 格式如下：

<h4><a href="somelinkhere" class="search_result_title" title="sometitle" data-followable="true">Some Text Here</a></h4>

我只想要 somelinkhere，hyperlink 本身。我知道它是 <h4>.

中包含的页面上的第一个 hyperlink

我尝试了一些 Regex 表达式与 preg_match 的组合，但它们都失败了。例如，以下是一种失败的方法：

$url = "https://www.example.com/?query=somequery";
$input = @file_get_contents($url) or die("Could not access file: $url");
preg_match_all('/<h4><a [^>]*\bhref\s*=\s*"\K[^"]*[^"]*/', $text, $results);
echo "$results";
echo "$results[0]";
echo "$results[0][0]";

我做了最后三个回声，因为我对格式 preg_match_all returns 不是很熟悉。我也尝试了 preg_match，结果相同。我只关心第一个这样的link，所以我不需要preg_match_all，但如果我能得到第一个结果，那也行。

解析页面并将 h4 中的第一个 hyperlink 放入变量的最佳方法是什么？

Answer 1

也许，如果你只喜欢提取第一个h4，那么你可能想将其修改为，

(?i)<h4><a [^>]*\bhref\s*=\s*"\s*([^"]*)\s*".*

带有 i 标志。

$re = '/(?i)<h4><a [^>]*\bhref\s*=\s*"\s*([^"]*)\s*".*/s';
$str = '<h4><a href="somelinkhere" class="search_result_title" title="sometitle" data-followable="true">Some Text Here</a></h4><h4><a href="somelinkhere" class="search_result_title" title="sometitle" data-followable="true">Some Text Here</a></h4>
';

preg_match_all($re, $str, $matches, PREG_SET_ORDER, 0);

foreach ($matches as $match) {
    print($match[1]);
}

输出

somelinkhere

如果您希望 simplify/modify/explore 表达式，regex101.com. If you'd like, you can also watch in this link 的右上面板已对其进行说明，它将如何匹配一些示例输入。

正则表达式在另一个标签内的页面上首先提取 link

Regex to extract first link on page inside another tag

php

html-parsing

输出