简化解析的正则表达式

Simplify parsed regex

我必须简化解析为特定数据类型的自定义正则表达式。 “简化”是指以下(强调我的):

Given the rules:

  • lowercase letters match themselves, eg.: a matches a and nothing else
  • parens enclosing only letters match their full sequence, eg.: (abc) matches abc and nothing else
  • square brackets enclosing only letters match every letters inside, eg.: [abc] matches a and b and c and nothing else

The following are all valid:

  • (a[bc]) matches ab and ac and nothing else
  • [a(bc)] matches a and bc and nothing else
  • (a(bc)) is the same as (abc) and matches abc and nothing else
  • [a[bc]] is the same as [abc] and matches a and b and c and nothing else

Regexes can be simplified. For example [a[[bb]b[[b]]](c)(d)] is really just the same as [abcd] which matches a, b, c and d.

我在 Haskell 中使用 attoparsec 和以下目标数据类型实现了一个简单的解析器组合器:

data Regex
  = Symbol Char
  | Concat [Regex] -- ()
  | Union [Regex] -- []
  deriving (Eq)

但是,我真的很难处理简化部分。我尝试通过展开它们的组合来减少 Concats 和 Unions,nubbing and concatMapping 无济于事。我认为我定义的数据类型可能不是最合适的,但我有 运行 的想法(深夜在这里)。你能帮我看看正确的方向吗?谢谢!

simplify :: Regex -> Regex
simplify (Symbol s) = Symbol s
simplify (Concat [Symbol c]) = Symbol c
simplify (Concat rs) = Concat $ simplify <$> rs
simplify (Union [Symbol c]) = Symbol c
simplify (Union rs) = Union $ nub $ simplify <$> rs

对于初学者,您缺少一些简单的改进。 simplify (Concat [x]) = x 对于 Union 也是如此:包装的正则表达式不需要专门是一个符号。

然后您需要开始查看包含其他 ConcatConcat,对于 Union 也是如此。当然,您首先会简化包装列表的元素,但在将结果塞回包装器之前,您会使用相同的包装器提取任何元素。类似于:

simplify (Concat xs) = 
  case concatMap liftConcats (map simplify xs) of
    [x] -> x
    xs -> Concat xs
  where liftConcats :: Regex -> [Regex]
        liftConcats r = _exerciseForTheReader

然后你可以为 Union 做类似的事情,同时加入 nub