简化解析的正则表达式

Question

我必须简化解析为特定数据类型的自定义正则表达式。 “简化”是指以下（强调我的）：

Given the rules:

lowercase letters match themselves, eg.: a matches a and nothing else

parens enclosing only letters match their full sequence, eg.: (abc) matches abc and nothing else

square brackets enclosing only letters match every letters inside, eg.: [abc] matches a and b and c and nothing else

The following are all valid:

(a[bc]) matches ab and ac and nothing else

[a(bc)] matches a and bc and nothing else

(a(bc)) is the same as (abc) and matches abc and nothing else

[a[bc]] is the same as [abc] and matches a and b and c and nothing else

Regexes can be simplified. For example [a[[bb]b[[b]]](c)(d)] is really just the same as [abcd] which matches a, b, c and d.

我在 Haskell 中使用 attoparsec 和以下目标数据类型实现了一个简单的解析器组合器：

data Regex
  = Symbol Char
  | Concat [Regex] -- ()
  | Union [Regex] -- []
  deriving (Eq)

但是，我真的很难处理简化部分。我尝试通过展开它们的组合来减少 Concats 和 Unions，nubbing and concatMapping 无济于事。我认为我定义的数据类型可能不是最合适的，但我有运行的想法（深夜在这里）。你能帮我看看正确的方向吗？谢谢！

simplify :: Regex -> Regex
simplify (Symbol s) = Symbol s
simplify (Concat [Symbol c]) = Symbol c
simplify (Concat rs) = Concat $ simplify <$> rs
simplify (Union [Symbol c]) = Symbol c
simplify (Union rs) = Union $ nub $ simplify <$> rs

Answer 1

对于初学者，您缺少一些简单的改进。 simplify (Concat [x]) = x 对于 Union 也是如此：包装的正则表达式不需要专门是一个符号。

然后您需要开始查看包含其他 Concat 的 Concat，对于 Union 也是如此。当然，您首先会简化包装列表的元素，但在将结果塞回包装器之前，您会使用相同的包装器提取任何元素。类似于：

simplify (Concat xs) = 
  case concatMap liftConcats (map simplify xs) of
    [x] -> x
    xs -> Concat xs
  where liftConcats :: Regex -> [Regex]
        liftConcats r = _exerciseForTheReader

然后你可以为 Union 做类似的事情，同时加入 nub。

简化解析的正则表达式

Simplify parsed regex

regex

haskell

graph-theory

abstract-syntax-tree