AST 的最佳 ADT 表示

Best ADT representation of AST

对于我试图表示为 Haskell ADT 的表达式,我有以下语法:

Expr = SimpleExpr [OPrelation SimpleExpr]  
SimpleExpr = [OPunary] Term {OPadd Term}  
Term = Factor {OPmult Factor}  

其中:

{} means 0 or more
[] means 0 or 1
OPmult, OPadd, OPrelation, OPunary are classes of operators

请注意,此语法确实具有优先权。

这是我尝试过的东西:

data Expr  = Expr SimpleExpr (Maybe OPrelation) (Maybe SimpleExpr)
data SimpleExpr = SimpleExpr (Maybe OPunary) Term [OPadd] [Term]
data Term = Term Factor [OPmult] [Factor]

事后看来,我认为这很糟糕,尤其是 [OPadd] [Term][OPmult] [Factor] 部分。因为,例如,在 1+2+3 的解析树中,它会将 [+, +] 放在一个分支中,而 [2, 3] 在另一个中,这意味着它们是解耦的。

什么是可以在下一阶段的编译中很好地发挥作用的良好表现形式?

最后,我假设在解析之后我必须传递解析树并将其简化为 AST?还是应该修改整个语法以使其不那么复杂?或者它是否足够抽象?

AST 不需要与语法那么接近。语法被构造成多个级别以编码优先级并使用重复来避免左递归,同时仍然能够正确处理左结合运算符。 AST 不需要担心这些事情。

相反,我会这样定义 AST:

data Expr = BinaryOperation BinaryOperator Expr Expr
          | UnaryOperation UnaryOperator Expr
          | Literal LiteralValue
          | Variable Id
data BinaryOperator = Add | Sub | Mul | Div
data UnaryOperator = Not | Negate

这是一个可能对您有帮助的补充答案。我不想破坏你的乐趣,所以这里有一个非常简单的示例语法:

-- Expr = Term ['+' Term]
-- Term = Factor ['*' Factor]
-- Factor = number | '(' Expr ')'
-- number = one or more digits

使用 CST

作为一种方法,我们可以将此语法表示为具体语法树 (CST):

data Expr = TermE Term | PlusE Term Term            deriving (Show)
data Term = FactorT Factor | TimesT Factor Factor   deriving (Show)
data Factor = NumberF Int | ParenF Expr             deriving (Show)

将具体语法转换为 CST 的基于秒差距的解析器可能如下所示:

expr :: Parser Expr
expr = do
  t1 <- term
  (PlusE t1 <$ symbol "+" <*> term)
    <|> pure (TermE t1)

term :: Parser Term
term = do
  f1 <- factor
  (TimesT f1 <$ symbol "*" <*> factor)
    <|> pure (FactorT f1)

factor :: Parser Factor
factor = NumberF . read <$> lexeme (many1 (satisfy isDigit))
    <|> ParenF <$> between (symbol "(") (symbol ")") expr

具有用于空白处理的辅助函数:

lexeme :: Parser a -> Parser a
lexeme p = p <* spaces

symbol :: String -> Parser String
symbol = lexeme . string

和主要入口点:

parseExpr :: String -> Expr
parseExpr pgm = case parse (spaces *> expr) "(string)" pgm of
  Right e -> e
  Left err -> error $ show err

之后我们可以 运行:

> parseExpr "1+1*(3+4)"
PlusE (FactorT (Number 1)) (TimesT (Number 1) (ParenF (PlusE
(FactorT (Number 3)) (FactorT (Number 4)))))
>

将其转换为以下 AST:

data AExpr -- Abstract Expression
  = NumberA Int
  | PlusA AExpr AExpr
  | TimesA AExpr AExpr

我们可以这样写:

aexpr :: Expr -> AExpr
aexpr (TermE t) = aterm t
aexpr (PlusE t1 t2) = PlusA (aterm t1) (aterm t2)

aterm :: Term -> AExpr
aterm (FactorT f) = afactor f
aterm (TimesT f1 f2) = TimesA (afactor f1) (afactor f2)

afactor :: Factor -> AExpr
afactor (NumberF n) = NumberA n
afactor (ParenF e) = aexpr e

要解释 AST,我们可以使用:

interp :: AExpr -> Int
interp (NumberA n) = n
interp (PlusA e1 e2) = interp e1 + interp e2
interp (TimesA e1 e2) = interp e1 * interp e2

然后写:

calc :: String -> Int
calc = interp . aexpr . parseExpr

之后我们就有了一个简陋的小计算器:

> calc "1 + 2 * (6 + 3)"
19
>

跳过 CST

作为另一种方法,我们可以将解析器替换为将 直接 解析为 AExpr:

类型的 AST 的解析器
expr :: Parser AExpr
expr = do
  t1 <- term
  (PlusA t1 <$ symbol "+" <*> term)
    <|> pure t1

term :: Parser AExpr
term = do
  f1 <- factor
  (TimesA f1 <$ symbol "*" <*> factor)
    <|> pure f1

factor :: Parser AExpr
factor = NumberA . read <$> lexeme (many1 (satisfy isDigit))
    <|> between (symbol "(") (symbol ")") expr

您可以看到这些解析器的结构变化有多大。消失的只是 type 级别的表达式、项和因子之间的区别,以及像 TermEFactorTParenF 这样的构造函数唯一的目的是允许这些类型相互嵌入。

在更复杂的情况下,CST 和 AST 解析器可能会表现出更大的差异。 (例如,在允许 1 + 2 + 3 的语法中,这可能在 CST 中表示为单个构造函数 data Expr = ... | PlusE [Term] | ...,但在相同的 [=25] 中具有一系列嵌套的二进制 PlusA 构造函数=] AST 类型如上。)

parseExpr 重新定义为 return 和 AExpr 并从 calc 中删除 aexpr 步骤后,其他一切都保持不变,我们仍然有:

> calc "1 + 2 * (6 + 3)"
19
>

参考程序

这是使用中间 CST 的完整程序:

-- Calc1.hs, using a CST

{-# OPTIONS_GHC -Wall #-}

module Calc1 where

import Data.Char
import Text.Parsec
import Text.Parsec.String

data Expr = TermE Term | PlusE Term Term            deriving (Show)
data Term = FactorT Factor | TimesT Factor Factor   deriving (Show)
data Factor = NumberF Int | ParenF Expr             deriving (Show)

lexeme :: Parser a -> Parser a
lexeme p = p <* spaces

symbol :: String -> Parser String
symbol = lexeme . string

expr :: Parser Expr
expr = do
  t1 <- term
  (PlusE t1 <$ symbol "+" <*> term)
    <|> pure (TermE t1)

term :: Parser Term
term = do
  f1 <- factor
  (TimesT f1 <$ symbol "*" <*> factor)
    <|> pure (FactorT f1)

factor :: Parser Factor
factor = NumberF . read <$> lexeme (many1 (satisfy isDigit))
    <|> ParenF <$> between (symbol "(") (symbol ")") expr

parseExpr :: String -> Expr
parseExpr pgm = case parse (spaces *> expr) "(string)" pgm of
  Right e -> e
  Left err -> error $ show err

data AExpr -- Abstract Expression
  = NumberA Int
  | PlusA AExpr AExpr
  | TimesA AExpr AExpr

aexpr :: Expr -> AExpr
aexpr (TermE t) = aterm t
aexpr (PlusE t1 t2) = PlusA (aterm t1) (aterm t2)

aterm :: Term -> AExpr
aterm (FactorT f) = afactor f
aterm (TimesT f1 f2) = TimesA (afactor f1) (afactor f2)

afactor :: Factor -> AExpr
afactor (NumberF n) = NumberA n
afactor (ParenF e) = aexpr e

interp :: AExpr -> Int
interp (NumberA n) = n
interp (PlusA e1 e2) = interp e1 + interp e2
interp (TimesA e1 e2) = interp e1 * interp e2

calc :: String -> Int
calc = interp . aexpr . parseExpr

下面是跳过显式 CST 表示的更传统解决方案的完整程序:

-- Calc2.hs, with direct parsing to AST

{-# OPTIONS_GHC -Wall #-}

module Calc where

import Data.Char
import Text.Parsec
import Text.Parsec.String

lexeme :: Parser a -> Parser a
lexeme p = p <* spaces

symbol :: String -> Parser String
symbol = lexeme . string

expr :: Parser AExpr
expr = do
  t1 <- term
  (PlusA t1 <$ symbol "+" <*> term)
    <|> pure t1

term :: Parser AExpr
term = do
  f1 <- factor
  (TimesA f1 <$ symbol "*" <*> factor)
    <|> pure f1

factor :: Parser AExpr
factor = NumberA . read <$> lexeme (many1 (satisfy isDigit))
    <|> between (symbol "(") (symbol ")") expr

parseExpr :: String -> AExpr
parseExpr pgm = case parse (spaces *> expr) "(string)" pgm of
  Right e -> e
  Left err -> error $ show err

data AExpr -- Abstract Expression
  = NumberA Int
  | PlusA AExpr AExpr
  | TimesA AExpr AExpr

interp :: AExpr -> Int
interp (NumberA n) = n
interp (PlusA e1 e2) = interp e1 + interp e2
interp (TimesA e1 e2) = interp e1 * interp e2

calc :: String -> Int
calc = interp . parseExpr

好的 非常好。受 sepp2k 的回复启发,我是这样做的(没有 CST):

AST:

data OP = OPplus | OPminus | OPstar | OPdiv 
    | OPidiv | OPmod |  OPand | OPeq | OPneq 
    | OPless | OPgreater | OPle | OPge 
    | OPin | OPor

data Expr = 
      Relation Expr OP Expr -- > < == >= etc..
    | Unary OP Expr  -- + -
    | Mult Expr OP Expr  -- * / div mod and
    | Add Expr OP Expr  -- + - or
    | FactorInt Int | FactorReal Double 
    | FactorStr String 
    | FactorTrue | FactorFalse 
    | FactorNil 
    | FactorDesig Designator  -- identifiers
    | FactorNot Expr
    | FactorFuncCall FuncCall deriving (Show)

解析器:

parseExpr :: Parser Expr
parseExpr = (try $ Relation <$> 
        parseSimpleExpr <*> parseOPrelation <*> parseSimpleExpr)
    <|> parseSimpleExpr

parseSimpleExpr :: Parser Expr
parseSimpleExpr = (try simpleAdd)
    <|> (try $ Unary <$> parseOPunary <*> simpleAdd)
    <|> (try $ Unary <$> parseOPunary <*> parseSimpleExpr)
    <|> parseTerm
    where simpleAdd = Add <$> parseTerm <*> parseOPadd <*> parseSimpleExpr

parseTerm :: Parser Expr
parseTerm = (try $ Mult <$> 
        parseFactor <*> parseOPmult <*> parseTerm)
    <|> parseFactor

parseFactor :: Parser Expr
parseFactor = 
        (parseKWnot >> FactorNot <$> parseFactor)
    <|> (exactTok "true"  >> return FactorTrue)
    <|> (exactTok "false" >> return FactorFalse) 
    <|> (parseNumber)
    <|> (FactorStr <$> parseString)
    <|> (betweenCharTok '(' ')' parseExpr)
    <|> (FactorDesig <$> parseDesignator)
    <|> (FactorFuncCall <$> parseFuncCall)

我没有包括像 parseOPadd 这样的基本解析器,因为它们是您所期望的并且很容易构建。

我仍然根据语法进行解析,但稍微调整了一下以匹配我的 AST。

您可以查看完整的源代码,它是 Pascal 的编译器 here