使用 PHP loop/foreach 进行简单的数据抓取
Simple data scraping using PHP loop/ foreach
我有一些代码可以在两个其他字符串(三明治)之间抓取一个字符串。它正在工作 - 但我需要遍历各种 "sandwich" 字符串。
//needle in haystack
$result 'sandwich: Today is a nice day.
sandwich: Today is a cloudy day.
sandwich: Today is a rainy day.
sandwich type 2: Yesterday I had an awesome time.
sandwich type 2: Yesterday I had an great time.';
$beginString = 'today is a';
$endString = 'day';
function extract_unit($haystack, $keyword1, $keyword2) {
$return = array();
while($a = strpos($haystack, $keyword1, $a)) { // loop until $a is FALSE
$a+=strlen($keyword1); // set offset to after $keyword1 word
if($b = strpos($haystack, $keyword2, $a)) { // if found $keyword2 position's
$return[] = trim(substr($haystack, $a, $b-$a)); // put result to $return array
}
}
return $return;
}
$text = $result;
$unit = extract_unit($text, $beginString, $endString);
print_r($unit);
//$unit returns= nice, cloudy and rainy
我需要遍历不同类型的 sentences/sandwiches 并能够捕捉所有形容词(nice cloudy rainy awesome great):
//needle in haystack
$result 'sandwich: Today is a nice day.
sandwich: Today is a cloudy day.
sandwich: Today is a rainy day.
sandwich type 2: Yesterday I had an awesome time.
sandwich type 2: Yesterday I had an great time.';
$beginString1 = 'today is a';
$endString1 = 'day';
$beginString2 = 'Yesterday I had an';
$endString2 = 'time';
[scaping code with loop...]
print_r($unit);
这是最终得到这个数组的目标:
Array ( [0] => nice [1] => cloudy [2] => rainy [3] => awesome [4] => great )
有什么想法吗?非常感激。
您可以使用正则表达式来抓取 strings
,如果您使用 arrays
而不是分隔 strings
没有问题,这可能是一个示例代码那:
$starts = array('Today is a', 'Yesterday I had an');
$ends = array('day', 'time');
$haystack = array(
'Today is a nice day.',
'Today is a cloudy day.',
'Today is a rainy day.',
'Yesterday I had an awesome time.',
'Yesterday I had an great time.'
);
function extract_unit($haystack, $starts, $ends){
$reg = '/.*?(?:' . implode('|', $starts) . ')(.*?)(?:' . implode('|', $ends) . ').*/';
foreach($haystack as $str){
if(preg_match($reg, $str)) $return[] = preg_replace($reg, '', $str);
}
return $return;
}
print_r (extract_unit($haystack, $starts, $ends));
编辑
根据@ven 的评论,我对代码做了一些更改,现在更精确了:
//---Array with all sandwiches
$between = array(
array('hay1=', 'hay=Gold'),
array('hay2=', 'hay=Silver')
);
$haystack = 'Data set 1: hay2= this is a bunch of hay hay1= Gold_Needle hay=Gold
Data Set 2: hay2=Silver_Needle hay=Silver';
function extract_unit($haystack, $between){
$return = array();
foreach($between as $item){
$reg = '/.*?' . $item[0] . '\s*(.*?)\s*' . $item[1] . '.*?/';
preg_match_all($reg, $haystack, $finded);
$return = array_merge($return, $finded[1]);
}
return $return;
}
print_r (extract_unit($haystack, $between));
结果将是:
Array
(
[0] => Gold_Needle
[1] => Silver_Needle
)
我有一些代码可以在两个其他字符串(三明治)之间抓取一个字符串。它正在工作 - 但我需要遍历各种 "sandwich" 字符串。
//needle in haystack
$result 'sandwich: Today is a nice day.
sandwich: Today is a cloudy day.
sandwich: Today is a rainy day.
sandwich type 2: Yesterday I had an awesome time.
sandwich type 2: Yesterday I had an great time.';
$beginString = 'today is a';
$endString = 'day';
function extract_unit($haystack, $keyword1, $keyword2) {
$return = array();
while($a = strpos($haystack, $keyword1, $a)) { // loop until $a is FALSE
$a+=strlen($keyword1); // set offset to after $keyword1 word
if($b = strpos($haystack, $keyword2, $a)) { // if found $keyword2 position's
$return[] = trim(substr($haystack, $a, $b-$a)); // put result to $return array
}
}
return $return;
}
$text = $result;
$unit = extract_unit($text, $beginString, $endString);
print_r($unit);
//$unit returns= nice, cloudy and rainy
我需要遍历不同类型的 sentences/sandwiches 并能够捕捉所有形容词(nice cloudy rainy awesome great):
//needle in haystack
$result 'sandwich: Today is a nice day.
sandwich: Today is a cloudy day.
sandwich: Today is a rainy day.
sandwich type 2: Yesterday I had an awesome time.
sandwich type 2: Yesterday I had an great time.';
$beginString1 = 'today is a';
$endString1 = 'day';
$beginString2 = 'Yesterday I had an';
$endString2 = 'time';
[scaping code with loop...]
print_r($unit);
这是最终得到这个数组的目标:
Array ( [0] => nice [1] => cloudy [2] => rainy [3] => awesome [4] => great )
有什么想法吗?非常感激。
您可以使用正则表达式来抓取 strings
,如果您使用 arrays
而不是分隔 strings
没有问题,这可能是一个示例代码那:
$starts = array('Today is a', 'Yesterday I had an');
$ends = array('day', 'time');
$haystack = array(
'Today is a nice day.',
'Today is a cloudy day.',
'Today is a rainy day.',
'Yesterday I had an awesome time.',
'Yesterday I had an great time.'
);
function extract_unit($haystack, $starts, $ends){
$reg = '/.*?(?:' . implode('|', $starts) . ')(.*?)(?:' . implode('|', $ends) . ').*/';
foreach($haystack as $str){
if(preg_match($reg, $str)) $return[] = preg_replace($reg, '', $str);
}
return $return;
}
print_r (extract_unit($haystack, $starts, $ends));
编辑
根据@ven 的评论,我对代码做了一些更改,现在更精确了:
//---Array with all sandwiches
$between = array(
array('hay1=', 'hay=Gold'),
array('hay2=', 'hay=Silver')
);
$haystack = 'Data set 1: hay2= this is a bunch of hay hay1= Gold_Needle hay=Gold
Data Set 2: hay2=Silver_Needle hay=Silver';
function extract_unit($haystack, $between){
$return = array();
foreach($between as $item){
$reg = '/.*?' . $item[0] . '\s*(.*?)\s*' . $item[1] . '.*?/';
preg_match_all($reg, $haystack, $finded);
$return = array_merge($return, $finded[1]);
}
return $return;
}
print_r (extract_unit($haystack, $between));
结果将是:
Array
(
[0] => Gold_Needle
[1] => Silver_Needle
)