仅从 RSS 提要解析 img src 时遇到问题?
Trouble parsing just img src from RSS feed?
我正在尝试根据以下示例创建 RSS reader:
http://www.w3schools.com/php/php_ajax_rss_reader.asp
具体来说,我正在尝试修改此示例,以便 reader 可以访问并显示来自任何给定网络漫画 RSS 提要的所有可用漫画图像(仅此而已)。我意识到可能有必要使代码至少有一点特定于站点,但我正在努力使其尽可能通用。目前,我已经修改了初始示例以生成一个 reader 来显示给定 RSS 提要列表的所有漫画。但是,它还会显示我试图摆脱的其他不需要的文本信息。到目前为止,这是我的代码,有一些提要特别给我带来麻烦:
index.php 文件:
<html>
<head>
<script>
function showRSS()
{
if (window.XMLHttpRequest)
{
// code for IE7+, Firefox, Chrome, Opera, Safari
xmlhttp=new XMLHttpRequest();
} else
{ // code for IE6, IE5
xmlhttp=new ActiveXObject("Microsoft.XMLHTTP");
}
xmlhttp.onreadystatechange=function()
{
if (xmlhttp.readyState==4 && xmlhttp.status==200)
{
document.getElementById("rssOutput").innerHTML=xmlhttp.responseText;
}
}
xmlhttp.open("GET","logger.php",true);
xmlhttp.send();
}
</script>
</head>
<body onload="showRSS()">
<div id="rssOutput"></div>
</body>
</html>
(很确定这个文件没有问题;我认为问题出现在下一个文件中,尽管我为了完整性而包含了这个文件)
logger.php:
<?php
//function to get all comics from an rss feed
function getComics($xml)
{
$xmlDoc = new DOMDocument();
$xmlDoc->load($xml);
$x=$xmlDoc->getElementsByTagName('item');
foreach ($x as $x)
{
$comic_image=$x->getElementsByTagName('description')->item(0)->childNodes->item(0)->nodeValue;
//output the comic
echo ($comic_image . "</p>");
echo ("<br>");
}
}
//create array of all RSS feed URLs
$URLs =
[
"SMBC" => "http://www.smbc-comics.com/rss.php",
"garfieldMinusGarfield" => "http://garfieldminusgarfield.net/rss",
"babyBlues" => "http://www.comicsyndicate.org/Feed/Baby%20Blues",
];
//Loop through all RSS feeds
foreach ($URLs as $xml)
{
getComics($xml);
}
?>
因为此方法在漫画图像之间包含额外的文本(SMBC 有很多随机内容,gMg 只是一些广告 link,婴儿布鲁斯的版权 link) ,我查看了 RSS 提要并得出结论,问题在于它是包含图像源的描述标签,但还包含其他内容。接下来,我尝试修改 getComics 函数以直接扫描图像标签,而不是首先查找描述标签。我将 DOMDocument creation/loading 和 URL 列表之间的部分替换为:
$images=$xmlDoc->getElementsByTagName('img');
print_r($images);
foreach ($images as $image)
{
//echo $image->item(0)->getAttribute('src');
echo $image->item(0)->nodeValue;
echo ("<br>");
}
但显然 getElementsByTagName 没有获取描述标签内嵌入的图像标签,因为我没有输出任何漫画图像,print_r 语句的输出如下:
DOMNodeList Object ( [length] => 0 ) DOMNodeList Object ( [length] => 0 )
最后,我尝试了两种方法的组合,尝试在解析出描述标签内容的代码中使用getElementsByTagNam('img')。我替换了行:
$comic_image=$x->getElementsByTagName('description')->item(0)->childNodes->item(0)->nodeValue;
与:
$comic_image=$x->getElementsByTagName('description')->item(0)->getElementsByTagName('img');
print_r($comic_image);
但这也什么也没找到,产生输出:
DOMNodeList Object ( [length] => 0 )
很抱歉背景很长,但我想知道是否有一种方法可以只从给定的 RSS 提要中解析 img src 而没有其他文本和 links 我不知道想要?
不胜感激
在内部,描述内容被转义,所以下面的代码应该可以工作:
foreach ($x as $y) {
$description = $y->getElementsByTagName('description')->item(0);
$decoded_description = htmlspecialchars_decode($description->nodeValue);
$description_xml = new DOMDocument();
$description_xml->loadHTML($decoded_description);
$comic_image = $description_xml->getElementsByTagName('img')->item(0)->getAttribute('src');
//output the comic
echo ($comic_image);
echo ("<br>");
}
供以后阅读此论坛的其他人参考,这是我最终得到的代码。我用调用 getImageTag 函数的 getImageSrc 函数替换了 for each 循环中的所有内容:
//function to find an image tag within a specific section if there is one
function getImageTag ($item,$tagName)
{
//pull desired section from given item
$section = $item->getElementsByTagName($tagName)->item(0);
//reparse description as if it were a string, because for some reason PHP woon't let you directly go to the source image with getElementsByTagName
$decoded_section = htmlspecialchars_decode($section->nodeValue);
$section_xml = new DOMDocument();
@$section_xml->loadHTML($decoded_section); //the @ is to suppress a bunch of warnings about characters this parser doesn't like
//pull image tag from section if there
$image_tag = $section_xml->getElementsByTagName('img')->item(0);
return $image_tag;
}
//function to get the image source URL from a given item
function getImageSrc ($item)
{
$image_tag = getImageTag($item,'description');
if (is_null($image_tag)) //if there was nothing with the tag name of image in the description section
{
//check in content:encoded section, because that's the next most likely place
$image_tag = getImageTag($item,'encoded');
if (is_null($image_tag)) //if there was nothing with the tag name of image in the encoded content section
{
//if the program gets here, it's probably because the feed is crap and doesn't include images,
//or it's because this particular item doesn't have a comic image in it
$image_src = '';
//THIS EXCEPTION WILL PROBABLY NEED TO BE HANDLED LATER TO AVOID POTENTIAL ERRORS
} else
{
$image_src = $image_tag->getAttribute('src');
}
} else
{
$image_src = $image_tag->getAttribute('src');
}
return $image_src;
}
我正在尝试根据以下示例创建 RSS reader:
http://www.w3schools.com/php/php_ajax_rss_reader.asp
具体来说,我正在尝试修改此示例,以便 reader 可以访问并显示来自任何给定网络漫画 RSS 提要的所有可用漫画图像(仅此而已)。我意识到可能有必要使代码至少有一点特定于站点,但我正在努力使其尽可能通用。目前,我已经修改了初始示例以生成一个 reader 来显示给定 RSS 提要列表的所有漫画。但是,它还会显示我试图摆脱的其他不需要的文本信息。到目前为止,这是我的代码,有一些提要特别给我带来麻烦:
index.php 文件:
<html>
<head>
<script>
function showRSS()
{
if (window.XMLHttpRequest)
{
// code for IE7+, Firefox, Chrome, Opera, Safari
xmlhttp=new XMLHttpRequest();
} else
{ // code for IE6, IE5
xmlhttp=new ActiveXObject("Microsoft.XMLHTTP");
}
xmlhttp.onreadystatechange=function()
{
if (xmlhttp.readyState==4 && xmlhttp.status==200)
{
document.getElementById("rssOutput").innerHTML=xmlhttp.responseText;
}
}
xmlhttp.open("GET","logger.php",true);
xmlhttp.send();
}
</script>
</head>
<body onload="showRSS()">
<div id="rssOutput"></div>
</body>
</html>
(很确定这个文件没有问题;我认为问题出现在下一个文件中,尽管我为了完整性而包含了这个文件)
logger.php:
<?php
//function to get all comics from an rss feed
function getComics($xml)
{
$xmlDoc = new DOMDocument();
$xmlDoc->load($xml);
$x=$xmlDoc->getElementsByTagName('item');
foreach ($x as $x)
{
$comic_image=$x->getElementsByTagName('description')->item(0)->childNodes->item(0)->nodeValue;
//output the comic
echo ($comic_image . "</p>");
echo ("<br>");
}
}
//create array of all RSS feed URLs
$URLs =
[
"SMBC" => "http://www.smbc-comics.com/rss.php",
"garfieldMinusGarfield" => "http://garfieldminusgarfield.net/rss",
"babyBlues" => "http://www.comicsyndicate.org/Feed/Baby%20Blues",
];
//Loop through all RSS feeds
foreach ($URLs as $xml)
{
getComics($xml);
}
?>
因为此方法在漫画图像之间包含额外的文本(SMBC 有很多随机内容,gMg 只是一些广告 link,婴儿布鲁斯的版权 link) ,我查看了 RSS 提要并得出结论,问题在于它是包含图像源的描述标签,但还包含其他内容。接下来,我尝试修改 getComics 函数以直接扫描图像标签,而不是首先查找描述标签。我将 DOMDocument creation/loading 和 URL 列表之间的部分替换为:
$images=$xmlDoc->getElementsByTagName('img');
print_r($images);
foreach ($images as $image)
{
//echo $image->item(0)->getAttribute('src');
echo $image->item(0)->nodeValue;
echo ("<br>");
}
但显然 getElementsByTagName 没有获取描述标签内嵌入的图像标签,因为我没有输出任何漫画图像,print_r 语句的输出如下:
DOMNodeList Object ( [length] => 0 ) DOMNodeList Object ( [length] => 0 )
最后,我尝试了两种方法的组合,尝试在解析出描述标签内容的代码中使用getElementsByTagNam('img')。我替换了行:
$comic_image=$x->getElementsByTagName('description')->item(0)->childNodes->item(0)->nodeValue;
与:
$comic_image=$x->getElementsByTagName('description')->item(0)->getElementsByTagName('img');
print_r($comic_image);
但这也什么也没找到,产生输出:
DOMNodeList Object ( [length] => 0 )
很抱歉背景很长,但我想知道是否有一种方法可以只从给定的 RSS 提要中解析 img src 而没有其他文本和 links 我不知道想要?
不胜感激
在内部,描述内容被转义,所以下面的代码应该可以工作:
foreach ($x as $y) {
$description = $y->getElementsByTagName('description')->item(0);
$decoded_description = htmlspecialchars_decode($description->nodeValue);
$description_xml = new DOMDocument();
$description_xml->loadHTML($decoded_description);
$comic_image = $description_xml->getElementsByTagName('img')->item(0)->getAttribute('src');
//output the comic
echo ($comic_image);
echo ("<br>");
}
供以后阅读此论坛的其他人参考,这是我最终得到的代码。我用调用 getImageTag 函数的 getImageSrc 函数替换了 for each 循环中的所有内容:
//function to find an image tag within a specific section if there is one
function getImageTag ($item,$tagName)
{
//pull desired section from given item
$section = $item->getElementsByTagName($tagName)->item(0);
//reparse description as if it were a string, because for some reason PHP woon't let you directly go to the source image with getElementsByTagName
$decoded_section = htmlspecialchars_decode($section->nodeValue);
$section_xml = new DOMDocument();
@$section_xml->loadHTML($decoded_section); //the @ is to suppress a bunch of warnings about characters this parser doesn't like
//pull image tag from section if there
$image_tag = $section_xml->getElementsByTagName('img')->item(0);
return $image_tag;
}
//function to get the image source URL from a given item
function getImageSrc ($item)
{
$image_tag = getImageTag($item,'description');
if (is_null($image_tag)) //if there was nothing with the tag name of image in the description section
{
//check in content:encoded section, because that's the next most likely place
$image_tag = getImageTag($item,'encoded');
if (is_null($image_tag)) //if there was nothing with the tag name of image in the encoded content section
{
//if the program gets here, it's probably because the feed is crap and doesn't include images,
//or it's because this particular item doesn't have a comic image in it
$image_src = '';
//THIS EXCEPTION WILL PROBABLY NEED TO BE HANDLED LATER TO AVOID POTENTIAL ERRORS
} else
{
$image_src = $image_tag->getAttribute('src');
}
} else
{
$image_src = $image_tag->getAttribute('src');
}
return $image_src;
}