如何在 <body> 中获取 <a> 标签但排除 header 和页脚部分

Question

如果我有这样的网页：

<body>
  <header>
    <a href='http://domain1.com'>link 1 text</a>
  </header>

  <a href='http://domain2.com'>link 2 text</a>

  <footer>
    <a href='http://domain3.com'>link 3 text</a>
  </footer>
</body>

如何从 <body> 中提取 <a> 标签，但排除 <header> 和 <footer> 中的链接？

在真实的网页中，<header> 中会有很多 <a> 标签，所以我不想循环遍历所有标签。

我想从不在 <header> 或 <footer> 标签内的每个 <a> 标签中提取 URL 和锚文本。

编辑：这就是我在 header:

中找到链接的方式

$header = $html->find('header',0);
foreach ($header->find('a') as $a){
  do something
}

我想这样做（注意使用“！”）

$foo = $html->find('!header,!footer');
foreach ($foo->find('a') as $a){
  do something
}

Answer 1

用简单的-html-dom当然是不可能的。你不能用简单的-html-dom.

来做到这一点

$html->find('body > a');

此 Css 选择器选择所有 <a> 父元素为 <body> 元素的元素。
您需要遍历 body 的子节点，然后得到 <a>

我建议看看How do you parse and process HTML/XML in PHP?

就我而言，我正在使用 Symfony/DomCrawler 和 Symfony/CssSelector 来执行此操作。

Answer 2

在查找链接之前从您正在使用的 DOM 中删除页眉和页脚。

<?php
    include("simple_html_dom.php");
    $source = <<<EOD
    <body>
        <header>
            <a href='http://domain1.com'>link 1 text</a>
        </header>

        <a href='http://domain2.com'>link 2 text</a>

        <a href='http://domain4.com'>link 4 text</a>

        <footer>
            <a href='http://domain3.com'>link 3 text</a>
        </footer>
    </body>
EOD;

    $html = str_get_html($source);
    foreach ($html->find('header, footer') as $unwanted) {
        $unwanted->outertext = "";
    }
    $html->load($html->save()); 
    $links = $html->find("a");
    foreach ($links as $link) {
        print $link;
};

?>

Answer 3

不破坏 body？你可以这样做：

$bad_as = $html->find('header a, footer a');
foreach($html->find('a') as $a){
  if(in_array($a, $bad_as)) continue;
  // do something
}

如何在 <body> 中获取 <a> 标签但排除 header 和页脚部分

How to get <a> tags in <body> but exclude header and footer sections

php

simple-html-dom