Arithmetic expressions pretty-printing done weird

Introduction

This post is not really a literate Haskell code, but I've got some code prior to writing the prose and don't want to throw anything away, so here is some preamble:

{-# LANGUAGE LambdaCase #-}
{-# LANGUAGE DeriveFunctor #-}
module Bidi where

import Control.Monad (ap)
import Data.Char
import Text.Read (readMaybe)

Consider the problem of parsing and pretty-printing arithmetic expressions. More specifically, given a following datatype:

data Exp
  = Lit Integer
  | Mul Exp Exp
  | Add Exp Exp
  | Sub Exp Exp
  deriving (Show)

We would like to parse 2 * 3 + 4 as Add (Mul (Lit 2) (Lit 3)) (Lit 4) and pretty-print Mul (Add (Lit 2) (Lit 3)) (Lit 4) as (2 + 3) * 4, but never put extra parentheses.

While this by itself is not a trivial problem, by itself it's not a particularly interesting one. The generalization of the problem I've been thinking about recently is whether it's possible to do these transformations for arbitrary context-free grammar? The question requires a bit more elaboration though.

Problem statement

Let's briefly consider how one would write a parser for a language like that using parser generator like Happy or using parser combinators: they would define some recursive datatypes (one datatype in case of arithmetic expressions) and then write either a grammar or a parser using combinators using constructors of said datatypes.

However, more often than not grammar rules closely correspond to the structure of defined datatypes, and one might wonder whether we could eliminate the repetitive work and define those simulatneously?

After couple of evenings of work, it can be seen that the answer is resounding "yes" for LL(1) grammars. In order to generate datatypes for AST of parse trees of such a grammar, one can:

Adding names to grammar rules makes the generated types more or less pleasant to use.

To illustrate that, consider an example of LL(1) grammar for arithmetic expressions (considering integer literal as a terminal in the grammar for simplicity), names of rules omitted for simplicity:

Term ::= Factor TermCont

TermCont ::=
TermCont ::= + Factor TermCont
TermCont ::= - Factor TermCont

Factor ::= Primary FactorCont

FactorCont ::=
FactorCont ::= * Primary FactorCont

Primary ::= int
Primary ::= ( Term )

Generated datatypes for the grammar would look like that:

Do you remember when I said that this file is "more or less" literate Haskell code? The following block is legal Haskell code (or at least should be, I haven't tried to actually load it so don't quote me on that), but is not considered to be a part of the program.

data Term = Term Factor TermCont

data TermCont
  = TC1
  | TC2 Factor TermCont
  | TC3 Factor TermCont

data Factor = Factor Primary FactorCont

data FactorCont
  = FC1
  | FC2 Primary FactorCont

data Primary
  = P1 Integer
  | P2 Term

When parsing using recursive descent, it's obvious which constructor to choose (the one that corresponds to chosen rule). Note that the generated datatypes are trivial to pretty-print as well, since bracket structure is encoded in the datatype itself and thus there is no need to do any choices whether the printer should add brackets in any given situation.

However, the problem with such a tree is that it's hard to work with: conceptually, arithmetic expressions most easily expressed by a single recursive datatype presented earlier, but here one would have to work with several mutually-recursive datatypes at once, complicating the functions significantly.

So the problem I would like to solve at some point, is: having a single definition of a grammar, generate datatype definitions, parser and pretty-printer. Datatypes should be convenient to use and pretty-printer should generate minimal amounts of brackets.

Problem I solved instead

I have no idea how to solve that problem. In order to generate concise datatypes, simple BNFs don't quite cut it: the possible solutions might be to add direct support for describing binary operators, their priorities, and associativities in the grammar, or enrich the grammar definition with tree rewrite rules to make more compact trees.

In order to get some inspriration about the general problem, I've decided to handroll something that I would like to get generated, which seems to be an extremely useful thing to figure out how the code generation would look like for any kind of code generation problem. So, yeah, we're getting back to the problem stated in the introduction which I've mentioned not being interesting by itself, but which is more interesting after specializing its generalization.

With datatype of ASTs that are pleasant to work with (Exp), let's define datatypes that more closely correspond to something that can be automatically generated:

data Term = Term Factor [(Tag, Factor)]
  deriving (Show)

data Factor = Factor Primary [Primary]
  deriving (Show)

data Primary
  = PrimLit Integer
  | Brackets Term
  deriving (Show)

data Tag
  = AddTag
  | SubTag
  deriving (Show)

Note that they're not actually automatically generated, the code generator I've done for LL(1) grammars mentioned before is actually written in Kotlin (and generates Kotlin code), and I'm not particularly interested in porting throwaway code done for experiments. Also, datatypes are a bit more pleasant to work with than those I've got to being generated (i.e. TermCont and FactorCont are replaced with lists), but not really far away and could probably be generated if I wasn't that lazy.

Most importantly, they still retain the property that pretty-printing can be done directed on the data structure without having to keep additional information in recursive function, and would only write down parentheses that are actually needed:

printTerm :: Term -> [String]
printTerm (Term f ts) = concat [printFactor f, concatMap printPair ts]
  where printPair (tag, f) = case tag of
          AddTag -> "+" : printFactor f
          SubTag -> "-" : printFactor f

printFactor :: Factor -> [String]
printFactor (Factor p ps) = concat [printPrimary p, concatMap printMul ps]
  where printMul p = "*" : printPrimary p

printPrimary :: Primary -> [String]
printPrimary = \case
  PrimLit lit -> [show lit]
  Brackets term -> ["("] ++ printTerm term ++ [")"]

printT :: Term -> String
printT = concat . printTerm

I'm not sure whether having the functions return list of strings is justified, but that's how I wrote it first and too lazy to rewrite now, so whatever.

Conversion functions from parse trees to ASTs are relatively trivial:

termToExp :: Term -> Exp
termToExp (Term first rest) = foldl build (factorToExp first) rest
  where
    build e (tag, fact) = case tag of
      AddTag -> Add e (factorToExp fact)
      SubTag -> Sub e (factorToExp fact)

factorToExp :: Factor -> Exp
factorToExp (Factor first rest) = foldl Mul (primaryToExp first) (primaryToExp <$> rest)

primaryToExp :: Primary -> Exp
primaryToExp = \case
  PrimLit p -> Lit p
  Brackets t -> termToExp t

Reverse conversion are not so much. The main problem seems to be that we need to generate three various datatypes depending on a constructor. In order to solve that problem, I have an auxilliary definition of generic Either-like sum of three types and a couple of helpers to "associate" them using Either in two different ways:

data U3 a b c
  = O1 a
  | O2 b
  | O3 c
  deriving (Show)

h1 :: U3 a b c -> Either a (Either b c)
h1 = \case
  O1 a -> Left a
  O2 b -> Right (Left b)
  O3 c -> Right (Right c)

h2 :: U3 a b c -> Either b (Either a c)
h2 = \case
  O1 a -> Right (Left a)
  O2 b -> Left b
  O3 c -> Right (Right c)

And here is the main course, which is an actual conversion from ASTs to parse trees:

expToParseTree :: Exp -> U3 Term Factor Primary
expToParseTree = \case
  Lit p -> O3 (PrimLit p)
  Add e1 e2 ->
    let combineTerms (Term f1 r1) (Term f2 r2) = O1 (Term f1 rest)
          where rest = r1 ++ [(AddTag, f2)] ++ r2
        toTerm = either id liftToTerm . h1
    in combineTerms (toTerm (expToParseTree e1)) (toTerm (expToParseTree e2))
  Sub e1 e2 ->
    let combineTerms (Term f1 r1) (Term f2 r2) = O1 (Term f1 rest)
          where rest = r1 ++ [(SubTag, f2)] ++ r2
        toTerm = either id liftToTerm . h1 . expToParseTree
    in combineTerms (toTerm e1) (toTerm e2)
  Mul e1 e2 ->
    let combineFactors (Factor f1 r1) (Factor f2 r2) = O2 (Factor f1 rest)
          where rest = r1 ++ [f2] ++ r2
        toFactor = either id liftToFactor . h2 . expToParseTree
    in combineFactors (toFactor e1) (toFactor e2)

liftToTerm and liftToFactor are auxilliary functions to "lift" one non-terminal type to another, the latter of which also ensures that extra brackets would not be unnecessarily used by avoiding packing Term of a single Factor:

liftToTerm :: Either Factor Primary -> Term
liftToTerm = \case
  Left f -> Term f []
  Right p -> Term (Factor p []) []

liftToFactor :: Either Term Primary -> Factor
liftToFactor = \case
  Left (Term f []) -> f
  Left t -> Factor (Brackets t) []
  Right p -> Factor p []

And another "runner" function to convert the result to Term no matter which type was actually returned:

expToTerm :: Exp -> Term
expToTerm = either id liftToTerm . h1 . expToParseTree

Making sure things work

In order to test the whole thing end-to-end, I've also implemented the parser from Strings to parse trees using handrolled parser combinators.

I always feel bad writing my own parser combinators again and again but it seems like with the frequency I've been writing Haskell lately (once every blue moon) it's much easier to do than to figure out how to use, say, Megaparsec with custom tokens.

newtype Parser t a = Parser { runParser :: [t] -> Either String ([t], a) }
  deriving (Functor)

instance Applicative (Parser t) where
  pure x = Parser (\input -> Right (input, x))
  (<*>) = ap

instance Monad (Parser t) where
  px >>= f = Parser $ \input ->
    case runParser px input of
      Left error -> Left error
      Right (next, x) -> runParser (f x) next

token :: String -> (t -> Either String r) -> Parser t r
token name f = Parser $ \case
  [] -> Left $ concat ["expected ", name, " got end of input"]
  (x : xs) -> do
    r <- f x
    pure (xs, r)

eat :: (Eq t, Show t) => t -> Parser t ()
eat t = token (show t) $ \got ->
  if t == got
  then Right ()
  else Left $ concat ["expected ", show t, " got ", show got]

integer :: Parser String Integer
integer = token "integer" $ \token ->
  case readMaybe token of
    Just x -> Right x
    Nothing -> Left $ concat ["expected integer, got ", token]

parseOr :: Parser t a -> Parser t a -> Parser t a
parseOr p1 p2 = Parser $ \input ->
  case runParser p1 input of
    Right (rest, x) -> Right (rest, x)
    Left _ -> runParser p2 input

parseSequence :: Parser t a -> Parser t [a]
parseSequence p = go
  where go = ((:) <$> p <*> go) `parseOr` pure []

parseTerm :: Parser String Term
parseTerm = Term <$> parseFactor <*> parseSequence ((,) <$> parseTag <*> parseFactor)
  where parseTag = (eat "+" *> pure AddTag) `parseOr` (eat "-" *> pure SubTag)

parseFactor :: Parser String Factor
parseFactor = Factor <$> parsePrimary <*> parseSequence (eat "*" *> parsePrimary)

parsePrimary :: Parser String Primary
parsePrimary = (eat "(" *> (Brackets <$> parseTerm) <* eat ")") `parseOr` (PrimLit <$> integer)

parse :: String -> Either String Term
parse input = snd <$> runParser parseTerm (tokenize input)

Brief aside about following tokenizer: I don't like the idea of using raw character streams as parser input, thus the parser combinators above are assuming the input has been tokenized before. However, for experimental programs such as this one, it's probably an overkill to write a tokenizer that does much more than to chunk multicharacters tokens, so the tokenizer is just a (total) function that returns a list of Strings:

tokenize :: String -> [String]
tokenize = \case
  [] -> []
  input@(f : rest)
    | isDigit f ->
        let (first, rest) = span isDigit input in first : tokenize rest
  (f : rest) | isSpace f -> tokenize rest
  (f : rest) -> [f] : tokenize rest

Closing remarks

In the end, I have no idea how to generate code for converting ASTs to parse trees because one that I ended up with looks rather ad-hoc and tied to specific data type to me. However, it might be possible to solve this problem if, say, BNF notation used to describe the grammar would include a primitive of defining binary operators for non-terminals.

This actually might be more pragmatic solution since it feels more natural than splitting the definition of syntactic category into several non-terminals in order to appease parsing algorihtms anyway.