preg_match_all 如何获取所有链接?
preg_match_all How to get all links?
我正在尝试从我正在抓取的页面获取所有带有 preg_match_all 的图像链接,这些链接以 http://i.ebayimg.com/ 开头并以 .jpg 结尾。我无法正确执行... :( 我试过了,但这不是我需要的...:[=16=]
preg_match_all('/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/', $contentas, $img_link);
普通链接也有同样的问题...我不知道如何写preg_match_all:
<a class="link--muted" href="http://suchen.mobile.de/fahrzeuge/details.html?id=218930381&daysAfterCreation=7&isSearchRequest=true&withImage=true&scopeId=C&categories=Limousine&damageUnrepaired=NO_DAMAGE_UNREPAIRED&zipcode=&fuels=DIESEL&ambitCountry=DE&maxPrice=11000&minFirstRegistrationDate=2006-01-01&makeModelVariant1.makeId=3500&makeModelVariant1.modelId=20&pageNumber=1" data-touch="hover" data-touch-wrapper=".cBox-body--resultitem">
非常感谢!!!
更新
我从这里尝试:
http://suchen.mobile.de/fahrzeuge/search.html?isSearchRequest=true&scopeId=C&makeModelVariant1.makeId=1900&makeModelVariant1.modelId=10&makeModelVariant1.modelDescription=&makeModelVariantExclusions%5B0%5D.makeId=&categories=Limousine&minSeats=&maxSeats=&doorCount=&minFirstRegistrationDate=2006-01-01&maxFirstRegistrationDate=&minMileage=&maxMileage=&minPrice=&maxPrice=11000&minPowerAsArray=&maxPowerAsArray=&maxPowerAsArray=PS&minPowerAsArray=PS&fuels=DIESEL&minCubicCapacity=&maxCubicCapacity=&ambitCountry=DE&zipcode=&q=&climatisation=&airbag=&daysAfterCreation=7&withImage=true&adLimitation=&export=&vatable=&maxConsumptionCombined=&emissionClass=&emissionsSticker=&damageUnrepaired=NO_DAMAGE_UNREPAIRED&numberOfPreviousOwners=&minHu=&usedCarSeals=
获取汽车链接和图像链接以及所有信息,有信息就一切正常,我的脚本运行良好,但我在抓取图像和链接时遇到问题。这是我的脚本:
<?php
$id= $_GET['id'];
$user= $_GET['user'];
$login=$_COOKIE['login'];
$query = mysql_query("SELECT pavadinimas,nuoroda,kuras,data,data_new from mobile where vartotojas='$user' and id='$id'");
$rezultatas=mysql_fetch_row($query);
$url = "$rezultatas[1]";
$info = file_get_contents($url);
function scrape_between($data, $start, $end){
$data = stristr($data, $start);
$data = substr($data, strlen($start));
$stop = stripos($data, $end);
$data = substr($data, 0, $stop);
return $data;
}
//turinio iskirpimas
$turinys = scrape_between($info, '<div class="g-col-9">', '<footer class="footer">');
//filtravimas naikinami mokami top skelbimai
$contentas = preg_replace('/<div class="cBox-body cBox-body--topResultitem".*?>(.*?)<\/div>/', '' ,$turinys);
//filtravimas baigtas
preg_match_all('/<span class="h3".*?>(.*?)<\/span>/',$contentas,$pavadinimas);
preg_match_all('/<span class="u-block u-pad-top-9 rbt-onlineSince".*?>(.*?)<\/span>/',$contentas,$data);
preg_match_all('/<span class="u-block u-pad-top-9".*?>(.*?)<\/span>/',$contentas,$miestas);
preg_match_all('/<span class="h3 u-block".*?>(.*?)<\/span>/', $contentas, $kaina);
preg_match_all('/<a[A-z0-9-_:="\.\/ ]+href="(http:\/\/suchen.mobile.de\/fahrzeuge\/[^"]*)"[A-z0-9-_:="\.\/ ]\s*>\s*<div/s', $contentas, $matches);
print_r($pavadinimas);
print_r($data);
print_r($miestas);
print_r($kaina);
print_r($result);
print_r($matches);
?>
1. 捕获所有 img
标签的 http://i.ebayimg.com/
开始的 src
属性:
正则表达式:/src=\"((?:http|https):\/\/i.ebayimg.com\/.+?.jpg)\"/i
这是一个例子:
$re = "/src=\"((?:http|https):\/\/i.ebayimg.com\/.+?.jpg)\"/i";
$str = "codeOfHTMLPage";
preg_match_all($re, $str, $matches);
现场查看:here
如果您想确保在 img
标签上捕获此 url,请使用此正则表达式 (请记住,性能会降低如果页面很长):
$re = "/<img(?:.*?)src=\"((?:http|https):\/\/i.ebayimg.com\/.+?.jpg)\"/i";
2. 捕获所有 a
标签中以 http://i.ebayimg.com/
开头的 href
属性:
正则表达式:/href=\"((?:http|https):\/\/suchen.mobile.de\/fahrzeuge\/.+?.jpg)\"/i
这是一个例子:
$re = "/href=\"((?:http|https):\/\/suchen.mobile.de\/fahrzeuge\/.+?.jpg)\"/i;
$str = "codeOfHTMLPage";
preg_match_all($re, $str, $matches);
现场查看:here
如果您想确保在 a
标签上捕获此 url,请使用此正则表达式 (请记住,性能会降低如果页面很长):
$re = "/<a(?:.*?)href=\"((?:http|https):\/\/suchen.mobile.de\/fahrzeuge\/.+?.jpg)\"/i";
使用 DOMDocument 更方便:
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile($yourURL);
$imgNodes = $dom->getElementsByTagName('img');
$result = [];
foreach ($imgNodes as $imgNode) {
$src = $imgNode->getAttribute('src');
$urlElts = parse_url($src);
$ext = strtolower(array_pop(explode('.', $urlElts['path'])));
if ($ext == 'jpg' && $urlElts['host'] == 'i.ebayimg.com')
$result[] = $src;
}
print_r($result);
要获取 "normal" 链接,请使用相同的方法 (DOMDocument + parse_url)。
我正在尝试从我正在抓取的页面获取所有带有 preg_match_all 的图像链接,这些链接以 http://i.ebayimg.com/ 开头并以 .jpg 结尾。我无法正确执行... :( 我试过了,但这不是我需要的...:[=16=]
preg_match_all('/(http|https|ftp|ftps)\:\/\/[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(\/\S*)?/', $contentas, $img_link);
普通链接也有同样的问题...我不知道如何写preg_match_all:
<a class="link--muted" href="http://suchen.mobile.de/fahrzeuge/details.html?id=218930381&daysAfterCreation=7&isSearchRequest=true&withImage=true&scopeId=C&categories=Limousine&damageUnrepaired=NO_DAMAGE_UNREPAIRED&zipcode=&fuels=DIESEL&ambitCountry=DE&maxPrice=11000&minFirstRegistrationDate=2006-01-01&makeModelVariant1.makeId=3500&makeModelVariant1.modelId=20&pageNumber=1" data-touch="hover" data-touch-wrapper=".cBox-body--resultitem">
非常感谢!!!
更新
我从这里尝试:
http://suchen.mobile.de/fahrzeuge/search.html?isSearchRequest=true&scopeId=C&makeModelVariant1.makeId=1900&makeModelVariant1.modelId=10&makeModelVariant1.modelDescription=&makeModelVariantExclusions%5B0%5D.makeId=&categories=Limousine&minSeats=&maxSeats=&doorCount=&minFirstRegistrationDate=2006-01-01&maxFirstRegistrationDate=&minMileage=&maxMileage=&minPrice=&maxPrice=11000&minPowerAsArray=&maxPowerAsArray=&maxPowerAsArray=PS&minPowerAsArray=PS&fuels=DIESEL&minCubicCapacity=&maxCubicCapacity=&ambitCountry=DE&zipcode=&q=&climatisation=&airbag=&daysAfterCreation=7&withImage=true&adLimitation=&export=&vatable=&maxConsumptionCombined=&emissionClass=&emissionsSticker=&damageUnrepaired=NO_DAMAGE_UNREPAIRED&numberOfPreviousOwners=&minHu=&usedCarSeals=
获取汽车链接和图像链接以及所有信息,有信息就一切正常,我的脚本运行良好,但我在抓取图像和链接时遇到问题。这是我的脚本:
<?php
$id= $_GET['id'];
$user= $_GET['user'];
$login=$_COOKIE['login'];
$query = mysql_query("SELECT pavadinimas,nuoroda,kuras,data,data_new from mobile where vartotojas='$user' and id='$id'");
$rezultatas=mysql_fetch_row($query);
$url = "$rezultatas[1]";
$info = file_get_contents($url);
function scrape_between($data, $start, $end){
$data = stristr($data, $start);
$data = substr($data, strlen($start));
$stop = stripos($data, $end);
$data = substr($data, 0, $stop);
return $data;
}
//turinio iskirpimas
$turinys = scrape_between($info, '<div class="g-col-9">', '<footer class="footer">');
//filtravimas naikinami mokami top skelbimai
$contentas = preg_replace('/<div class="cBox-body cBox-body--topResultitem".*?>(.*?)<\/div>/', '' ,$turinys);
//filtravimas baigtas
preg_match_all('/<span class="h3".*?>(.*?)<\/span>/',$contentas,$pavadinimas);
preg_match_all('/<span class="u-block u-pad-top-9 rbt-onlineSince".*?>(.*?)<\/span>/',$contentas,$data);
preg_match_all('/<span class="u-block u-pad-top-9".*?>(.*?)<\/span>/',$contentas,$miestas);
preg_match_all('/<span class="h3 u-block".*?>(.*?)<\/span>/', $contentas, $kaina);
preg_match_all('/<a[A-z0-9-_:="\.\/ ]+href="(http:\/\/suchen.mobile.de\/fahrzeuge\/[^"]*)"[A-z0-9-_:="\.\/ ]\s*>\s*<div/s', $contentas, $matches);
print_r($pavadinimas);
print_r($data);
print_r($miestas);
print_r($kaina);
print_r($result);
print_r($matches);
?>
1. 捕获所有 img
标签的 http://i.ebayimg.com/
开始的 src
属性:
正则表达式:/src=\"((?:http|https):\/\/i.ebayimg.com\/.+?.jpg)\"/i
这是一个例子:
$re = "/src=\"((?:http|https):\/\/i.ebayimg.com\/.+?.jpg)\"/i";
$str = "codeOfHTMLPage";
preg_match_all($re, $str, $matches);
现场查看:here
如果您想确保在 img
标签上捕获此 url,请使用此正则表达式 (请记住,性能会降低如果页面很长):
$re = "/<img(?:.*?)src=\"((?:http|https):\/\/i.ebayimg.com\/.+?.jpg)\"/i";
2. 捕获所有 a
标签中以 http://i.ebayimg.com/
开头的 href
属性:
正则表达式:/href=\"((?:http|https):\/\/suchen.mobile.de\/fahrzeuge\/.+?.jpg)\"/i
这是一个例子:
$re = "/href=\"((?:http|https):\/\/suchen.mobile.de\/fahrzeuge\/.+?.jpg)\"/i;
$str = "codeOfHTMLPage";
preg_match_all($re, $str, $matches);
现场查看:here
如果您想确保在 a
标签上捕获此 url,请使用此正则表达式 (请记住,性能会降低如果页面很长):
$re = "/<a(?:.*?)href=\"((?:http|https):\/\/suchen.mobile.de\/fahrzeuge\/.+?.jpg)\"/i";
使用 DOMDocument 更方便:
libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile($yourURL);
$imgNodes = $dom->getElementsByTagName('img');
$result = [];
foreach ($imgNodes as $imgNode) {
$src = $imgNode->getAttribute('src');
$urlElts = parse_url($src);
$ext = strtolower(array_pop(explode('.', $urlElts['path'])));
if ($ext == 'jpg' && $urlElts['host'] == 'i.ebayimg.com')
$result[] = $src;
}
print_r($result);
要获取 "normal" 链接,请使用相同的方法 (DOMDocument + parse_url)。