php DOMDocument 提取带有锚点或 alt 的链接

php DOMDocument extract links with anchor or alt

我要提取所有 link 包含在图像上带有锚点或 alt 属性的页面,如果这个先出现,则包含在 link 中。

$html = '<a href="lien.fr">Anchor</a>';

必须return"lien.fr;Anchor"

$html = '<a href="lien.fr"><img alt="Alt Anchor">Anchor</a>';

必须return"lien.fr;Alt Anchor"

$html = '<a href="lien.fr">Anchor<img alt="Alt Anchor"></a>';

必须return"lien.fr;Anchor"

我做到了:

$doc = new DOMDocument();
$doc->loadHTML($html);

$out = "";
$n = 0;
$links = $doc->getElementsByTagName('a');

foreach ($links as $element) {
    $href = $img_alt = $anchor = "";
    $href = $element->getAttribute('href');
    $n++;
    if (!strrpos($href, "panier?")) {

        if ($element->firstChild->nodeName == "img") {

            $imgs = $element->getElementsByTagName('img');

            foreach ($imgs as $img) {
                if ($anchor = $img->getAttribute('alt')) {
                    break;
                }
            }
        }

        if (($anchor == "") && ($element->nodeValue)) {
            $anchor = $element->nodeValue;
        }

        $out[$n]['link'] = $href;
        $out[$n]['anchor'] = $anchor;
    }
}

这似乎可行,但如果有一些 space 或缩进,它就不起作用 作为

$html = '<a href="link.fr">
                    <img src="ceinture-gris" alt="alt anchor"/>
                </a>';

$element->firstChild->nodeName 将是文本

像这样:

$doc = new DOMDocument();
$doc->loadHTML($html);

// Output texts that will later be joined with ';'
$out = [];
// Maximum number of items to add to $out
$max_out_items = 2;
// List of img tag attributes that will be parsed by the loop below
// (in the order specified in this array!)
$img_attributes = ['alt', 'src', 'title'];

$links = $doc->getElementsByTagName('a');
foreach ($links as $element) {
  if ($href = trim($element->getAttribute('href'))) {
    $out []= $href;
    if (count($out) >= $max_out_items)
      break;
  }

  foreach ($element->childNodes as $child) {
    if ($child->nodeType === XML_TEXT_NODE &&
      $text = trim($child->nodeValue))
    {
      $out []= $text;
      if (count($out) >= $max_out_items)
        break;
    } elseif ($child->nodeName == 'img') {
      foreach ($img_attributes as $attr_name) {
        if ($attr_value = trim($child->getAttribute($attr_name))) {
          $out []= $attr_value;
          if (count($out) >= $max_out_items)
            goto Result;
        }
      }
    }
  }
}

Result:
echo $out = implode(';', $out);