php DOMDocument 提取带有锚点或 alt 的链接
php DOMDocument extract links with anchor or alt
我要提取所有 link 包含在图像上带有锚点或 alt 属性的页面,如果这个先出现,则包含在 link 中。
$html = '<a href="lien.fr">Anchor</a>';
必须return"lien.fr;Anchor"
$html = '<a href="lien.fr"><img alt="Alt Anchor">Anchor</a>';
必须return"lien.fr;Alt Anchor"
$html = '<a href="lien.fr">Anchor<img alt="Alt Anchor"></a>';
必须return"lien.fr;Anchor"
我做到了:
$doc = new DOMDocument();
$doc->loadHTML($html);
$out = "";
$n = 0;
$links = $doc->getElementsByTagName('a');
foreach ($links as $element) {
$href = $img_alt = $anchor = "";
$href = $element->getAttribute('href');
$n++;
if (!strrpos($href, "panier?")) {
if ($element->firstChild->nodeName == "img") {
$imgs = $element->getElementsByTagName('img');
foreach ($imgs as $img) {
if ($anchor = $img->getAttribute('alt')) {
break;
}
}
}
if (($anchor == "") && ($element->nodeValue)) {
$anchor = $element->nodeValue;
}
$out[$n]['link'] = $href;
$out[$n]['anchor'] = $anchor;
}
}
这似乎可行,但如果有一些 space 或缩进,它就不起作用
作为
$html = '<a href="link.fr">
<img src="ceinture-gris" alt="alt anchor"/>
</a>';
$element->firstChild->nodeName 将是文本
像这样:
$doc = new DOMDocument();
$doc->loadHTML($html);
// Output texts that will later be joined with ';'
$out = [];
// Maximum number of items to add to $out
$max_out_items = 2;
// List of img tag attributes that will be parsed by the loop below
// (in the order specified in this array!)
$img_attributes = ['alt', 'src', 'title'];
$links = $doc->getElementsByTagName('a');
foreach ($links as $element) {
if ($href = trim($element->getAttribute('href'))) {
$out []= $href;
if (count($out) >= $max_out_items)
break;
}
foreach ($element->childNodes as $child) {
if ($child->nodeType === XML_TEXT_NODE &&
$text = trim($child->nodeValue))
{
$out []= $text;
if (count($out) >= $max_out_items)
break;
} elseif ($child->nodeName == 'img') {
foreach ($img_attributes as $attr_name) {
if ($attr_value = trim($child->getAttribute($attr_name))) {
$out []= $attr_value;
if (count($out) >= $max_out_items)
goto Result;
}
}
}
}
}
Result:
echo $out = implode(';', $out);
我要提取所有 link 包含在图像上带有锚点或 alt 属性的页面,如果这个先出现,则包含在 link 中。
$html = '<a href="lien.fr">Anchor</a>';
必须return"lien.fr;Anchor"
$html = '<a href="lien.fr"><img alt="Alt Anchor">Anchor</a>';
必须return"lien.fr;Alt Anchor"
$html = '<a href="lien.fr">Anchor<img alt="Alt Anchor"></a>';
必须return"lien.fr;Anchor"
我做到了:
$doc = new DOMDocument();
$doc->loadHTML($html);
$out = "";
$n = 0;
$links = $doc->getElementsByTagName('a');
foreach ($links as $element) {
$href = $img_alt = $anchor = "";
$href = $element->getAttribute('href');
$n++;
if (!strrpos($href, "panier?")) {
if ($element->firstChild->nodeName == "img") {
$imgs = $element->getElementsByTagName('img');
foreach ($imgs as $img) {
if ($anchor = $img->getAttribute('alt')) {
break;
}
}
}
if (($anchor == "") && ($element->nodeValue)) {
$anchor = $element->nodeValue;
}
$out[$n]['link'] = $href;
$out[$n]['anchor'] = $anchor;
}
}
这似乎可行,但如果有一些 space 或缩进,它就不起作用 作为
$html = '<a href="link.fr">
<img src="ceinture-gris" alt="alt anchor"/>
</a>';
$element->firstChild->nodeName 将是文本
像这样:
$doc = new DOMDocument();
$doc->loadHTML($html);
// Output texts that will later be joined with ';'
$out = [];
// Maximum number of items to add to $out
$max_out_items = 2;
// List of img tag attributes that will be parsed by the loop below
// (in the order specified in this array!)
$img_attributes = ['alt', 'src', 'title'];
$links = $doc->getElementsByTagName('a');
foreach ($links as $element) {
if ($href = trim($element->getAttribute('href'))) {
$out []= $href;
if (count($out) >= $max_out_items)
break;
}
foreach ($element->childNodes as $child) {
if ($child->nodeType === XML_TEXT_NODE &&
$text = trim($child->nodeValue))
{
$out []= $text;
if (count($out) >= $max_out_items)
break;
} elseif ($child->nodeName == 'img') {
foreach ($img_attributes as $attr_name) {
if ($attr_value = trim($child->getAttribute($attr_name))) {
$out []= $attr_value;
if (count($out) >= $max_out_items)
goto Result;
}
}
}
}
}
Result:
echo $out = implode(';', $out);