PHP 带有 preg_match 和 foreach 的标签系统

Question

我正在尝试为我的网站构建这个标签系统，它会检查书面文章（可能是 400-1000 个单词）中的特定单词，并从数组中创建一个包含所有找到的关键字的字符串。

我做的还行，但有些问题我想解决。

$a = "This is my article and it's about apples and pears. I like strawberries as well though.";

$targets = array('apple', 'apples','pear','pears','strawberry','strawberries','grape','grapes');
foreach($targets as $t)
{
   if (preg_match("/\b" . $t . "\b/i", $a)) {
    $b[] = $t;
   }
}
echo $b[0].",".$b[1].",".$b[2].",".$b[3];
$tags = $b[0].",".$b[1].",".$b[2].",".$b[3];

首先，我想知道，如果有什么办法，可以提高效率。我有一个包含大约 5.000 个关键字并且每天都在扩展的数据库。

A 你可以看到，我不知道如何获得所有匹配项。我正在写 $b[0]、$b[1] 等

我希望它只生成一个包含所有匹配项的字符串 - 但每个匹配项只有 1 次。如果 apples 被提及 5 次，那么字符串中应该只有 1 次。

A 说 - 这行得通。但我不觉得，这是最好的解决方案。

编辑：

我正在尝试这个，但我根本无法让它工作。

$a = "This is my article and it's about apples and pears. I like strawberries as well though.";

$targets = array('apple', 'apples','pear','pears','strawberry','strawberries','grape','grapes');
$targets = implode('|', $targets);
$b = [];
preg_match("/\b(" . $targets . ")\b/i", $a, $b);

echo $b;

Answer 1

preg_match 已经保存匹配。所以：

int preg_match ( string $pattern , string $subject [, array &$matches [, int $flags = 0 [, int $offset = 0 ]]] )

参数 3 已经在保存匹配项，更改为：

if (preg_match("/\b" . $t . "\b/i", $a)) {
    $b[] = $t;
}

为此：

$matches = [];
preg_match("/\b" . $t . "\b/i", $a, $matches);
$b = array_merge($b, $matches);

但是，如果您直接比较单词，文档建议使用 strpos()。

Tip
Do not use preg_match() if you only want to check if one string is contained in another string. Use strpos() instead as it will be faster.

编辑

如果您仍然想使用 preg_match，您可以改进（性能）您的代码，替换为：

$targets = array('apple', 'apples','pear','pears','strawberry','strawberries','grape','grapes');
foreach($targets as $t)
{
   if (preg_match("/\b" . $t . "\b/i", $a)) {
    $b[] = $t;
   }
}

有了这个：

$targets = array('apple', 'apples','pear','pears','strawberry','strawberries','grape','grapes');
$targets = implode('|', $targets);

preg_match("/\b(" . $t . ")\b/i", $a, $matches);

在这里，您将所有 $targets 与 |（管道）连接起来，因此您的正则表达式是这样的：(target1|target2|target3|targetN) 因此您只执行一次搜索，而不是 foreach。

Answer 2

首先，我想提供一个非正则表达式的方法，然后我将进入一些冗长的正则表达式考虑。

因为您的搜索 "needles" 是完整的单词，您可以像这样利用 str_word_count() 的魔力：

代码：(Demo)

$targets=['apple','apples','pear','pears','strawberry','strawberries','grape','grapes'];  // all lowercase
$input="Apples, pears, and strawberries are delicious. I probably favor the flavor of strawberries most. My brother's favorites are crabapples and grapes.";
$lowercase_input=strtolower($input);                      // eliminate case-sensitive issue
$words=str_word_count($lowercase_input,1);                // split into array of words, permitting: ' and -
$unique_words=array_flip(array_flip($words));             // faster than array_unique()
$targeted_words=array_intersect($targets,$unique_words);  // retain matches
$tags=implode(',',$targeted_words);                       // glue together with commas
echo $tags;

echo "\n\n";
// or as a one-liner
echo implode(',',array_intersect($targets,array_flip(array_flip(str_word_count(strtolower($input),1)))));

输出：

apples,pears,strawberries,grapes

apples,pears,strawberries,grapes

现在关于正则表达式...

虽然 matiaslauriti 的回答可能会让您得到正确的结果，但它几乎没有尝试提供任何大的效率提升。

我说两点：

不要在循环中使用 preg_match()，因为 preg_match_all() 专门设计用于在单个调用中捕获多次事件。（稍后在答案中提供代码）
尽可能压缩你的模式逻辑...

假设您有这样的输入：

$input="Today I ate an apple, then a pear, then a strawberry. This is my article and it's about apples and pears. I like strawberries as well though.";

如果你使用这个标签数组：

$targets=['apple','apples','pear','pears','strawberry','strawberries','grape','grapes'];

生成一个简单的管道正则表达式模式，如：

/\b(?:apple|apples|pear|pears|strawberry|strawberries|grape|grapes)\b/i

正则表达式引擎需要 677 步 才能匹配 $input 中的所有水果。 (Demo)

相比之下，如果您使用 ? 量词压缩标签元素，如下所示：

\b(?:apples?|pears?|strawberry|strawberries|grapes?)\b

您的模式变得简洁和高效，只需 501 步 即可获得相同的预期结果。 (Demo)

可以针对简单的关联（包括复数和动词变位）以编程方式生成此压缩模式。

这里有一个处理singular/plural关系的方法：

foreach($targets as $v){
    if(substr($v,-1)=='s'){                       // if tag ends in 's'
        if(in_array(substr($v,0,-1),$targets)){   // if same words without trailing 's' exists in tag list
            $condensed_targets[]=$v.'?';          // add '?' quantifier to end of tag
        }else{
            $condensed_targets[]=$v;              // add tag that is not plural (e.g. 'dress')
        }
    }elseif(!in_array($v.'s',$targets)){          // if tag doesn't end in 's' and no regular plural form
            $condensed_targets[]=$v;              // add tag with irregular pluralization (e.g. 'strawberry')
    }
}
echo '/\b(?:',implode('|',$condensed_targets),")\b/i\n";
// /\b(?:apples?|pears?|strawberry|strawberries|grapes?)\b/i

此技术只能处理最简单的情况。您可以通过仔细检查标签列表并识别相关标签并压缩它们来真正提高性能。

执行我的上述方法以在每次页面加载时压缩管道模式将花费您的用户加载时间。我非常强烈的建议是为不断增长的标签保留一个数据库 table，这些标签存储为正则表达式标签。当新标签为 encountered/generated 时，自动将它们单独添加到 table。您应该定期查看 ~5000 个关键字并找出可以合并而不会丢失准确性的标签。

它甚至可以帮助您维护数据库 table 逻辑，如果您有一列用于正则表达式模式，另一列显示该行的正则表达式模式包含的内容的 csv：

---------------------------------------------------------------
|  Pattern               |   Tags                             |
---------------------------------------------------------------
|  apples?               |  apple,apples                      |
---------------------------------------------------------------
|  walk(?:s|er|ed|ing)?  |  walk,walks,walker,walked,walking  |
---------------------------------------------------------------
|  strawberry            |  strawberry                        |
---------------------------------------------------------------
|  strawberries          |  strawberries                      |
---------------------------------------------------------------

为了提高效率，您可以像这样合并草莓行和草莓行来更新 table 数据：

---------------------------------------------------------------
|  strawberr(?:y|ies)    |  strawberry,strawberries           |
---------------------------------------------------------------

这么简单的改进，如果你只检查$input这两个标签，所需的步骤从59下降到40.

因为您要处理超过 5000 个标签，所以性能提升会非常明显。这种改进最好在人工层面上处理，但您可以使用一些编程技术来识别共享内部子字符串的标签。

当您想使用您的模式列值时，只需将它们从您的数据库中提取出来，通过管道将它们组合在一起，然后将它们放在 preg_match_all().

中

*请记住，在将标签压缩成单个模式时，您应该使用非捕获组，因为我要遵循的代码将通过避免捕获组来减少内存使用。

代码(Demo Link):

$input="Today I ate an apple, then a pear, then a strawberry. This is my article and it's about apples and pears. I like strawberries as well though.";
$targets=['apple','apples','pear','pears','strawberry','strawberries','grape','grapes'];
//echo '/\b(?:',implode('|',$targets),")\b/i\n";

// condense singulars & plurals forms using ? quantifier
foreach($targets as $v){
    if(substr($v,-1)=='s'){                       // if tag ends in 's'
        if(in_array(substr($v,0,-1),$targets)){   // if same words without trailing 's' exists in tag list
            $condensed_targets[]=$v.'?';          // add '?' quantifier to end of tag
        }else{
            $condensed_targets[]=$v;              // add tag that is not plural (e.g. 'dress')
        }
    }elseif(!in_array($v.'s',$targets)){          // if tag doesn't end in 's' and no regular plural form
            $condensed_targets[]=$v;              // add tag with irregular pluralization (e.g. 'strawberry')
    }
}
echo '/\b(?:',implode('|',$condensed_targets),")\b/i\n\n";

// use preg_match_all and call it just once without looping!
$tags=preg_match_all("/\b(?:".implode('|',$condensed_targets).")\b/i",$input,$out)?$out[0]:null;
echo "Found tags: ";
var_export($tags);

输出：

/\b(?:apples?|pears?|strawberry|strawberries|grapes?)\b/i

Found tags: array ( 0 => 'apple', 1 => 'pear', 2 => 'strawberry', 3 => 'apples', 4 => 'pears', 5 => 'strawberries', )

...如果你已经成功阅读了我的 post，你可能遇到了像 OP 一样的问题，并且你想在没有 regrets/mistakes 的情况下继续前进。请转至 my related Code Review post 了解有关边缘案例注意事项和方法逻辑的更多信息。

PHP 带有 preg_match 和 foreach 的标签系统

PHP tag system with preg_match and foreach

php

regex

word-boundary

preg-match-all

keyword-search