PHP 用于从大字符串中查找子字符串的正则表达式 - 匹配开始和结束

Question

我想从一个巨大的大海捞针中找到页面的标题，但没有任何 class 或唯一 ID，所以我不能在这里使用 DOM 解析器，我知道我必须使用正则表达式。这是我要查找的示例：

<a href="http://example.com/xyz">
    Series Hell In Heaven information
</a>
<a href="http://example.com/123">
    Series What is going information
</a>

输出应该是一个数组

[0] => Series Hell In Heaven information
[1] => Series What is going information

所有系列标题都以系列开头，以信息结尾。从一大串多个东西中我只想提取标题。目前我正在尝试使用正则表达式但它不起作用，这就是我现在正在做的事情。

$reg = "/^Series\..*information$/";
$str = $html;
preg_match_all($reg, $str, $matches);
echo "<pre>";
    print_r($matches);
echo "</pre>";

我不太了解制作正则表达式。帮助将不胜感激。谢谢

Answer 1

试试这个：

$str = '<a href="http://example.com/xyz">
    Series Hell In Heaven information
</a>
<a href="http://example.com/123">
    Series What is going information
</a>';
preg_match_all('/Series(.*?)information/', $str, $matches);
echo "<pre>";
    print_r($matches);
echo "</pre>";

捕获将在 $matches[2] 中。基本上你的正则表达式不匹配，因为 \..

[编辑]

如果您还需要单词 Series 和 information，那么您不需要捕获，只需执行 /Series.*?information/ 并在 $matches[0].[= 中找到匹配项15=]

Answer 2

尝试

 preg_match_all('/(Series.+?information)/', $str, $matches );

作为

https://regex101.com/r/oJ0jZ4/1

正如我在评论中所说，删除文字 \. 点以及开始和结束锚点......我也会使用非贪婪要求任何字符。 .+?

否则你可以匹配这个

Seriesinformation

如果系列或信息的大小写可能会发生变化，例如

系列....信息

添加 /i 标志，如

     preg_match_all('/(Series.+?information)/i', $str, $matches );

外部捕获组并不是真的需要，但我认为它在那里看起来更好，如果你只想要没有系列或信息的变量内容然后将捕获 ( ) 移动到那个位.

 preg_match_all('/Series(.+?)information/i', $str, $matches );

请注意，您需要 trim() 匹配，因为它可能在开头和结尾有 space 或像这样将它们添加到正则表达式中。

 preg_match_all('/Series\s(.+?)\sinformation/i', $str, $matches );

但这将排除 Series information 与一个 space 的匹配。

如果你想确保你不匹配

等信息

[Series Hell In Heaven information Series Hell In Heaven information]

匹配所有这些你可以使用积极的回顾

preg_match_all('/(Series.+?(?<=information))/i', $str, $matches );

反之，如果有可能会包含两个信息词

   <a href="http://example.com/123">
        Series information is power information
   </a>

你可以做到这一点

    preg_match_all('/(Series[^<]+)</i', $str, $matches );

它将匹配 <，如 </a

作为旁注，您可以使用 PHPQuery 库（它是一个 DOM 解析器），并查找包含这些词的 a 标记。

https://github.com/punkave/phpQuery

和

https://code.google.com/archive/p/phpquery/wikis/Manual.wiki

使用类似

的东西

  $tags = $doc->getElementsByTagName("a:contains('Series)")->text();

这是一个优秀的解析库 HTML

PHP 用于从大字符串中查找子字符串的正则表达式 - 匹配开始和结束

PHP Regex to find a substring from a big string - Matching start and end

php

regex

simple-html-dom