用于多语言标记化的稳定正则表达式或简单库？

Question

我们的产品需要搜索功能，并且主要以英语为主。因此，空间标记化效果相对较好（尽管并不总是最好的主意）。

我们最近正在进军日本市场，发现了一些复杂的因素。日语有 2 个关键问题：1) wordsCanBeStrungTogetherWithoutSpaces 2) 日语使用不同的标点符号 symbols

我们有 1 的解决方法，但是 "word" 有几百个字符会导致一些复杂情况，因此解决 (2) 是理想的选择。从最严格的意义上讲，我正在尝试解决日语问题，但实际上我想要一种至少可以不考虑字母表来拆分句子的方法。是否有适合基于 unicode 范围进行拆分的正则表达式？还是需要自定义并包括每种不同的语言？

快速搜索显示 https://unicodelookup.com/#full%20stop/1 似乎各种 "full stop" 没有模式（据我所知），但数量不多，我可以构建以匹配那些.我担心的是有些边缘情况我不知道我不知道。

Answer 1

尝试这样的事情开始。
该词在组 1 中。

[^\pL\pN]*([\pL\pN](?:[\pL\pN_-]|(?![?.!])\pP(?=[\pL\pN\pP]))*)(?<!\pP)

https://regex101.com/r/YEgUQ3/1

已解释

 # Unicode

 [^\pL\pN]*                    # Strip non-letters/numbers               
 (                             # (1 start)
      [\pL\pN]                      # First letter/number
      (?:                           # Word body
           [\pL\pN_-]                    # Letter/number or '-'
        |                              # or,
           (?! [?.!] )                   # ( Not Special word ending punctuation, Add more here )
           \pP                           # Punctuation
           (?= [\pL\pN\pP] )             #   if followed by punctuation/letter/number
      )*                            # Do many times
 )                             # (1 end)
 (?<! \pP )                    # Don't end on a punctuation

Answer 2

看起来 unicode 类别实际上是为此而设计的。以下正则表达式似乎工作正常：

[\p{L}\p{Nd}]+ https://regex101.com/r/YEgUQ3/2

并有一个简单的解释：

\p{L} matches any kind of letter from any language
\p{Nd} matches a digit zero through nine in any script except ideographic scripts

显然 letter 的意思是严格来说不是标点符号。表意数字似乎只是文字。

用于多语言标记化的稳定正则表达式或简单库？

A stable regular expression or simple library for multi-lingual tokenization?

regex

nlp

tokenize