编写多个正则表达式模式来解析 HTML
Writing multiple regex pattern to parse HTML
我正在使用 file_get_contents()
获取 HTML 网页,我得到如下 table,有超过 150 行:
<tr class="tabrow ">
<td class="tabcol tdmin_2l">FIRST+DATA</td>
<td class="tabcol">
<a class="modal-button" title="SECOND+DATA" href="THIRD+DATA" rel="{handler: 'iframe', size: {x: 800, y: 640}, overlayOpacity: 0.9, classWindow: 'phocamaps-plugin-window', classOverlay: 'phocamaps-plugin-overlay'}">
asdxxx
</a>
</td>
<td class="tabcol"></td>
<td class="tabcol">FOURTH+DATA</td>
</tr>
我想通过 preg_match_all()
调用获得 FIRST DATA
、SECOND DATA
、THIRD DATA
和 FOURTH DATA
。我尝试编写多个模式,但我无法成功。这是我尝试过的:
preg_match_all('/(<td class="tabcol tdmin_2l">|title=")(.*?)(<\/td>|")/s', $raw, $matches, PREG_SET_ORDER);
什么是真正的模式?
试试这个:
$str = <<<HTML
<tr class="tabrow ">
<td class="tabcol tdmin_2l">FIRST+DATA</td>
<td class="tabcol"><a class="modal-button" title="SECOND+DATA" href="THIRD+DATA" rel="{handler: 'iframe', size: {x: 800, y: 640}, overlayOpacity: 0.9, classWindow: 'phocamaps-plugin-window', classOverlay: 'phocamaps-plugin-overlay'}">asdxxx</a></td>
<td class="tabcol"></td>
<td class="tabcol">FOURTH+DATA</td>
</tr>
HTML;
preg_match_all('/<td[^>]*>(.*?)<\/td>/im', $str, $td_matches);
preg_match('/ title="([^"]*)"/i', $td_matches[1][1], $title);
preg_match('/ href="([^"]*)"/i', $td_matches[1][1], $href);
echo $td_matches[1][0] . "\n";
echo $title[1] . "\n";
echo $href[1] . "\n";
echo $td_matches[1][3];
它没有直接回答您的问题,但这是正确的方法。
您应该避免使用正则表达式解析 HTML/XML 内容。不知道为什么?
Entire HTML parsing is not possible with regular expressions, since it depends on matching the opening and the closing tag which is not possible with regexps.
Regular expressions can only match regular languages but HTML is a context-free language. The only thing you can do with regexps on HTML is heuristics but that will not work on every condition. It should be possible to present a HTML file that will be matched wrongly by any regular expression.
—
改用 DOM parser。下面是它的概况:
composer require symfony/dom-crawler symfony/css-selector
<?php
require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
$html = <<<HTML
<tr class="tabrow ">
<td class="tabcol tdmin_2l">FIRST+DATA</td>
<td class="tabcol"><a class="modal-button" title="SECOND+DATA" href="THIRD+DATA" rel="{handler: 'iframe', size: {x: 800, y: 640}, overlayOpacity: 0.9, classWindow: 'phocamaps-plugin-window', classOverlay: 'phocamaps-plugin-overlay'}">asdxxx</a></td>
<td class="tabcol"></td>
<td class="tabcol">FOURTH+DATA</td>
</tr>
HTML;
$crawler = new Crawler($html);
$first = $crawler->filter('.tabcol.tdmin_2l')->text();
$second = $crawler->filter('.tabcol:nth-child(2) a')->attr('title');
$third = $crawler->filter('.tabcol:nth-child(2) a')->attr('href');
$fourth = $crawler->filter('.tabcol:nth-child(4)')->text();
var_dump($first, $second, $third, $fourth);
// Outputs:
// string(10) "FIRST+DATA"
// string(11) "SECOND+DATA"
// string(10) "THIRD+DATA"
// string(11) "FOURTH+DATA"
更简单、更干净,对吧?
使用此类解析器,您还可以使用 XPath 提取元素。
我正在使用 file_get_contents()
获取 HTML 网页,我得到如下 table,有超过 150 行:
<tr class="tabrow ">
<td class="tabcol tdmin_2l">FIRST+DATA</td>
<td class="tabcol">
<a class="modal-button" title="SECOND+DATA" href="THIRD+DATA" rel="{handler: 'iframe', size: {x: 800, y: 640}, overlayOpacity: 0.9, classWindow: 'phocamaps-plugin-window', classOverlay: 'phocamaps-plugin-overlay'}">
asdxxx
</a>
</td>
<td class="tabcol"></td>
<td class="tabcol">FOURTH+DATA</td>
</tr>
我想通过 preg_match_all()
调用获得 FIRST DATA
、SECOND DATA
、THIRD DATA
和 FOURTH DATA
。我尝试编写多个模式,但我无法成功。这是我尝试过的:
preg_match_all('/(<td class="tabcol tdmin_2l">|title=")(.*?)(<\/td>|")/s', $raw, $matches, PREG_SET_ORDER);
什么是真正的模式?
试试这个:
$str = <<<HTML
<tr class="tabrow ">
<td class="tabcol tdmin_2l">FIRST+DATA</td>
<td class="tabcol"><a class="modal-button" title="SECOND+DATA" href="THIRD+DATA" rel="{handler: 'iframe', size: {x: 800, y: 640}, overlayOpacity: 0.9, classWindow: 'phocamaps-plugin-window', classOverlay: 'phocamaps-plugin-overlay'}">asdxxx</a></td>
<td class="tabcol"></td>
<td class="tabcol">FOURTH+DATA</td>
</tr>
HTML;
preg_match_all('/<td[^>]*>(.*?)<\/td>/im', $str, $td_matches);
preg_match('/ title="([^"]*)"/i', $td_matches[1][1], $title);
preg_match('/ href="([^"]*)"/i', $td_matches[1][1], $href);
echo $td_matches[1][0] . "\n";
echo $title[1] . "\n";
echo $href[1] . "\n";
echo $td_matches[1][3];
它没有直接回答您的问题,但这是正确的方法。
您应该避免使用正则表达式解析 HTML/XML 内容。不知道为什么?
Entire HTML parsing is not possible with regular expressions, since it depends on matching the opening and the closing tag which is not possible with regexps.
Regular expressions can only match regular languages but HTML is a context-free language. The only thing you can do with regexps on HTML is heuristics but that will not work on every condition. It should be possible to present a HTML file that will be matched wrongly by any regular expression.
—
改用 DOM parser。下面是它的概况:
composer require symfony/dom-crawler symfony/css-selector
<?php
require 'vendor/autoload.php';
use Symfony\Component\DomCrawler\Crawler;
$html = <<<HTML
<tr class="tabrow ">
<td class="tabcol tdmin_2l">FIRST+DATA</td>
<td class="tabcol"><a class="modal-button" title="SECOND+DATA" href="THIRD+DATA" rel="{handler: 'iframe', size: {x: 800, y: 640}, overlayOpacity: 0.9, classWindow: 'phocamaps-plugin-window', classOverlay: 'phocamaps-plugin-overlay'}">asdxxx</a></td>
<td class="tabcol"></td>
<td class="tabcol">FOURTH+DATA</td>
</tr>
HTML;
$crawler = new Crawler($html);
$first = $crawler->filter('.tabcol.tdmin_2l')->text();
$second = $crawler->filter('.tabcol:nth-child(2) a')->attr('title');
$third = $crawler->filter('.tabcol:nth-child(2) a')->attr('href');
$fourth = $crawler->filter('.tabcol:nth-child(4)')->text();
var_dump($first, $second, $third, $fourth);
// Outputs:
// string(10) "FIRST+DATA"
// string(11) "SECOND+DATA"
// string(10) "THIRD+DATA"
// string(11) "FOURTH+DATA"
更简单、更干净,对吧?
使用此类解析器,您还可以使用 XPath 提取元素。