是否可以在令牌中插入数组值？

Question

我正在研究 homoglyphs module，我必须构建可以找到对应于 ASCII 等效文本的同形文字的正则表达式。

例如，我的字符没有同形字替代品：

my $f = 'f';

和可混淆的字符：

my @o = 'o', 'о', 'ο'; # ASCII o, Cyrillic o, Greek omicron

我可以轻松构建正则表达式来检测同形词短语 'foo':

say 'Suspicious!' if $text ~~ / $f @o @o /;

但是如果我不知道在编译时要检测的值，我应该如何编写这样的正则表达式呢？假设我想检测邮件中包含同形 'cash' 单词的网络钓鱼。我可以用所有备选方案构建序列：

my @lookup = ['c', 'с', 'ϲ', 'ς'], ['a', 'а', 'α'], 's', 'h'; # arbitrary runtime length

现在显然下面的解决方案不能"unpack"数组元素进入正则表达式：

/ @lookup / # doing LTM, not searching elements in sequence

我可以通过手动引用每个元素来解决这个问题，并组成备选方案的文本表示以获得可以作为正则表达式进行评估的字符串。并使用字符串插值从中构建令牌：

my $regexp-ish = textualize( @lookup ); # string "[ 'c' | 'с' | 'ϲ' | 'ς' ] [ 'a' | 'а' | 'α' ] 's' 'h'"
my $token = token { <$regexp-ish> }

但这很容易出错。是否有任何更清晰的解决方案来从编译时未知的任意数量的元素动态组合正则表达式？

Answer 1

我不确定这是最好的使用方法。

我还没有在 Intl:: 中实现混淆 ¹ 模块，尽管我确实计划最终解决它，这里有两种我可以想象的不同方式一个令牌看起来。²

my token confusable($source) {
  :my $i = 0;                                    # create a counter var
  [
    <?{                                          # succeed only if
      my $a = self.orig.substr: self.pos+$i, 1;  #   the test character A
      my $b = $source.substr: $i++, 1;           #   the source character B and

      so $a eq $b                                #   are the same or
      || $a eq %*confusables{$b}.any;            #   the A is one of B's confusables
    }> 
    .                                            # because we succeeded, consume a char
  ] ** {$source.chars}                           # repeat for each grapheme in the source
}

这里我使用了动态散列 %*confusables，它将以某种方式填充 — 这将取决于您的模块，甚至不一定是动态的（例如，具有签名 :($source, %confusables) 或引用模块变量等

然后您可以让您的代码按如下方式工作：

say $foo ~~ /<confusable: 'foo'>/

这可能是解决问题的最佳方式，因为它会给你更多的控制权——我对你的模块进行了深入研究，很明显你想启用 2 对 1 字形关系，最终你我可能想运行直接在字符上编码。

如果您只接受一对一的关系，则可以使用更简单的令牌：

my token confusable($source) {
  :my @chars = $source.comb;            # split the source 
  @(                                    # match the array based on 
     |(                                 #   a slip of
        %confusables{@chars.head}       #     the confusables 
        // Empty                        #     (or nothing, if none)
     ),                                 #
     @a.shift                           #   and the char itself
   )                                    #
   ** {$source.chars}                   # repeating for each source char
}

@(…) 结构让您可以有效地创建一个要插入的临时数组。在这种情况下，我们只是加入了与原版混淆的内容，仅此而已。你必须要小心，因为一个不存在的散列项将 return 类型对象 (Any) 并且这会把事情搞砸（因此 // Empty）

在任何一种情况下，您都希望在标记中使用参数，因为即时构建正则表达式充满了潜在的陷阱和插值错误。

¹Unicode 调用同形异义词 "visually similar characters" 和 "confusables"。

²此处的动态散列 %confusables 可以通过多种方式填充，并且不一定需要是动态的，因为它可以通过参数填充（使用 :($source, %confusables) 之类的签名或引用模块变量。

Answer 2

Unicode::Security module implements confusables by using the Unicode consortium tables。它实际上没有使用正则表达式，只是在那些表中查找不同的字符。

是否可以在令牌中插入数组值？

Is it possible to interpolate Array values in token?

raku