如何从 RSS 提要中提取所有 URL 链接?
How do I extract all URL links from an RSS feed?
我需要定期从 NY Times RSS feed 中提取 所有 新闻文章的链接到 MySQL 数据库。我该怎么做呢?我可以使用一些正则表达式(在 PHP 中)来匹配链接吗?或者还有其他替代方法吗?提前致谢。
更新 2 我测试了下面的代码,不得不修改
$links = $dom->getElementsByTagName('a');
并将其更改为:
$links = $dom->getElementsByTagName('link');
链接输出成功。祝你好运
更新 看起来这里有一个完整的答案:How do you parse and process HTML/XML in PHP.
我开发了一个解决方案,以便我可以递归我网站中的所有链接。我已经删除了验证域与每次递归相同的代码(因为问题没有要求这个),但如果需要,您可以轻松添加一个。
使用html5 DOMDocument,您可以解析HTML或XML文档来读取链接。它比使用正则表达式更好。尝试这样的事情
<?php
//300 seconds = 5 minutes - or however long you need so php won't time out
ini_set('max_execution_time', 300);
// using a global to store the links in case there is recursion, it makes it easy.
// You could of course pass the array by reference for cleaner code.
$alinks = array();
// set the link to whatever you are reading
$link = "http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml";
// do the search
linksearch($link, $alinks);
// show results
var_dump($alinks);
function linksearch($url, & $alinks) {
// use $queue if you want this fn to be recursive
$queue = array();
echo "<br>Searching: $url";
$href = array();
//Load the HTML page
$html = file_get_contents($url);
//Create a new DOM document
$dom = new DOMDocument;
//Parse the HTML. The @ is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
@$dom->loadHTML($html);
//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('link');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
$href[] = $link->getAttribute('href');
}
foreach (array_unique($href) as $link) {
// add to list of links found
$queue[] = $link;
}
// remove duplicates
$queue = array_unique($queue);
// get links that haven't yet been processed
$queue = array_diff($queue, $alinks);
// update array passed by reference with new links found
$alinks = array_merge($alinks, $queue);
if (count($queue) > 0) {
foreach ($queue as $link) {
// recursive search - uncomment out if you use this
// remember to check that the domain is the same as the one starting from
// linksearch($link, $alinks);
}
}
}
DOM+Xpath 允许您使用表达式获取节点。
RSS 项目链接
获取 RSS link 元素(每个项目 link):
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$expression = '//channel/item/link';
foreach ($xpath->evaluate($expression) as $link) {
var_dump($link->textContent);
}
Atom 链接
atom:link
有不同的语义,它们是 Atom 命名空间的一部分,用于描述关系。 NYT 使用 standout
关系来标记特色故事。要获取 Atom link,您需要为命名空间注册一个前缀。属性也是节点,所以你可以直接获取它们:
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$xpath->registerNamespace('a', 'http://www.w3.org/2005/Atom');
$expression = '//channel/item/a:link[@rel="standout"]/@href';
foreach ($xpath->evaluate($expression) as $link) {
var_dump($link->value);
}
这里是 other relations,例如 prev
和 next
。
HTML 链接(a
个元素)
description
个元素包含 HTML 个片段。要从中提取 link,您必须将 HTML 加载到单独的 DOM 文档中。
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$xpath->registerNamespace('a', 'http://www.w3.org/2005/Atom');
$expression = '//channel/item/description';
foreach ($xpath->evaluate($expression) as $description) {
$fragment = new DOMDocument();
$fragment->loadHtml($description->textContent);
$fragmentXpath = new DOMXpath($fragment);
foreach ($fragmentXpath->evaluate('//a[@href]/@href') as $link) {
var_dump($link->value);
}
}
我需要定期从 NY Times RSS feed 中提取 所有 新闻文章的链接到 MySQL 数据库。我该怎么做呢?我可以使用一些正则表达式(在 PHP 中)来匹配链接吗?或者还有其他替代方法吗?提前致谢。
更新 2 我测试了下面的代码,不得不修改
$links = $dom->getElementsByTagName('a');
并将其更改为:
$links = $dom->getElementsByTagName('link');
链接输出成功。祝你好运
更新 看起来这里有一个完整的答案:How do you parse and process HTML/XML in PHP.
我开发了一个解决方案,以便我可以递归我网站中的所有链接。我已经删除了验证域与每次递归相同的代码(因为问题没有要求这个),但如果需要,您可以轻松添加一个。
使用html5 DOMDocument,您可以解析HTML或XML文档来读取链接。它比使用正则表达式更好。尝试这样的事情
<?php
//300 seconds = 5 minutes - or however long you need so php won't time out
ini_set('max_execution_time', 300);
// using a global to store the links in case there is recursion, it makes it easy.
// You could of course pass the array by reference for cleaner code.
$alinks = array();
// set the link to whatever you are reading
$link = "http://rss.nytimes.com/services/xml/rss/nyt/HomePage.xml";
// do the search
linksearch($link, $alinks);
// show results
var_dump($alinks);
function linksearch($url, & $alinks) {
// use $queue if you want this fn to be recursive
$queue = array();
echo "<br>Searching: $url";
$href = array();
//Load the HTML page
$html = file_get_contents($url);
//Create a new DOM document
$dom = new DOMDocument;
//Parse the HTML. The @ is used to suppress any parsing errors
//that will be thrown if the $html string isn't valid XHTML.
@$dom->loadHTML($html);
//Get all links. You could also use any other tag name here,
//like 'img' or 'table', to extract other tags.
$links = $dom->getElementsByTagName('link');
//Iterate over the extracted links and display their URLs
foreach ($links as $link){
//Extract and show the "href" attribute.
$href[] = $link->getAttribute('href');
}
foreach (array_unique($href) as $link) {
// add to list of links found
$queue[] = $link;
}
// remove duplicates
$queue = array_unique($queue);
// get links that haven't yet been processed
$queue = array_diff($queue, $alinks);
// update array passed by reference with new links found
$alinks = array_merge($alinks, $queue);
if (count($queue) > 0) {
foreach ($queue as $link) {
// recursive search - uncomment out if you use this
// remember to check that the domain is the same as the one starting from
// linksearch($link, $alinks);
}
}
}
DOM+Xpath 允许您使用表达式获取节点。
RSS 项目链接
获取 RSS link 元素(每个项目 link):
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$expression = '//channel/item/link';
foreach ($xpath->evaluate($expression) as $link) {
var_dump($link->textContent);
}
Atom 链接
atom:link
有不同的语义,它们是 Atom 命名空间的一部分,用于描述关系。 NYT 使用 standout
关系来标记特色故事。要获取 Atom link,您需要为命名空间注册一个前缀。属性也是节点,所以你可以直接获取它们:
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$xpath->registerNamespace('a', 'http://www.w3.org/2005/Atom');
$expression = '//channel/item/a:link[@rel="standout"]/@href';
foreach ($xpath->evaluate($expression) as $link) {
var_dump($link->value);
}
这里是 other relations,例如 prev
和 next
。
HTML 链接(a
个元素)
description
个元素包含 HTML 个片段。要从中提取 link,您必须将 HTML 加载到单独的 DOM 文档中。
$xml = file_get_contents($url);
$document = new DOMDocument();
$document->loadXml($xml);
$xpath = new DOMXPath($document);
$xpath->registerNamespace('a', 'http://www.w3.org/2005/Atom');
$expression = '//channel/item/description';
foreach ($xpath->evaluate($expression) as $description) {
$fragment = new DOMDocument();
$fragment->loadHtml($description->textContent);
$fragmentXpath = new DOMXpath($fragment);
foreach ($fragmentXpath->evaluate('//a[@href]/@href') as $link) {
var_dump($link->value);
}
}