Left recursive parser combinators via sharing

Published 2023-09-10 in sections English, Haskell.

At this year’s ICFP in Seattle I gave a talk about my rec-def Haskell library, which I have blogged about before here. While my functional pearl paper focuses on a concrete use-case and the tricks of the implementation, in my talk I put the emphasis on the high-level idea: it beholds of a declarative lazy functional like Haskell that recursive equations just work whenever they describe a (unique) solution. Like in the paper, I used equations between sets as the running example, and only conjectured that it should also work for other domains, in particular parser combinators.

Naturally, someone called my bluff and asked if I actually tried it. I had not, but I should have, because it works nicely and is actually more straight-forward than with sets. I wrote up a prototype and showed it off a few days later as a lightning talk at Haskell Symposium; here is the write up that goes along with that.

Parser combinators

Parser combinators are libraries that provide little functions (combinators) that you compose to define your parser directly in your programming language, as opposed to using external tools that read some grammar description and generate parser code, and are quite popular in Haskell (e.g. parsec, attoparsec, megaparsec).

Let us define a little parser that recognizes sequences of as:

ghci> let aaa = tok 'a' *> aaa <|> pure ()
ghci> parse aaa "aaaa"
Just ()
ghci> parse aaa "aabaa"
Nothing

Left-recursion

This works nicely, but just because we were lucky: We wrote the parser to recurse on the right (of the *>), and this happens to work. If we put the recursive call first, it doesn’t anymore:

ghci> let aaa = aaa <* tok 'a' <|> pure ()
ghci> parse aaa "aaaa"
^CInterrupted.

This is a well-known problem (see for example Nicolas Wu’s overview paper), all the common parser combinator libraries cannot handle it and the usual advise is to refactor your grammar to avoid left recursion.

But there are some libraries that can handle left recursion, at least with a little help from the programmer. I found two variations:

The library provides an explicit fix point combinator, and as long as that is used, left-recursion works. This is for example described by Frost, Hafiz and Callaghan by, and (of course) Oleg Kiselyov has an implementation of this too.
The library expects explicit labels on recursive productions, so that the library can recognize left-recursion. I found an implementation of this idea in the Agda.Utils.Parser.MemoisedCPS module in the Agda code, the gll library seems to follow this style and Jaro discusses it as well.

I took the module from the Agda source and simplified a bit for the purposes of this demonstration (Parser.hs). Indeed, I can make the left-recursive grammar work:

ghci> let aaa = memoise ":-)" $ aaa <* tok 'a' <|> pure ()
ghci> parse aaa "aaaa"
Just ()
ghci> parse aaa "aabaa"
Nothing

It does not matter what I pass to memoise, as long as I do not pass the same key when memoising a different parser.

For reference, an excerpt of the the API of Parser:

data Parser k tok a -- k is type of keys, tok type of tokens (e.g. Char)
instance Functor (Parser k tok)
instance Applicative (Parser k tok)
instance Alternative (Parser k tok)
instance Monad (Parser k tok)
parse :: Parser k tok a -> [tok] -> Maybe a
sat :: (tok -> Bool) -> Parser k tok tok
tok :: Eq tok => tok -> Parser k tok tok
memoise :: Ord k => k -> Parser k tok a -> Parser k tok a

To follow the agenda set out in my talk, I now want to wrap that parser in a way that relieves me from having to insert the calls to memoise. To start, I import that parser qualified, define a newtype around it, and start lifting some of the functions:

import qualified Parser as P

newtype Parser tok a = MkP { unP :: P.Parser Unique tok a }

parse :: Parser tok a -> [tok] -> Maybe a
parses (MkP p) = P.parse p

sat :: Typeable tok => (tok -> Bool) -> Parser tok tok
sat p = MkP (P.sat p)

tok :: Eq tok => tok -> Parser tok tok
tok t = MkP (P.tok t)

So far, nothing interesting had to happen, because so far I cannot build recursive parsers. The first interesting combinator that allows me to do that is <*> from the Applicative class, so I should use memoise there. The question is: Where does the unique key come from?

Proprioception

As with the rec-def library, pure code won’t do, and I have to get my hands dirty: I really want a fresh unique label out of thin air. To that end, I define the following combinator, with naming aided by Richard Eisenberg:

propriocept :: (Unique -> a) -> a
propriocept f = unsafePerformIO $ f <$> newUnique

A thunk defined with propriocept will know about it’s own identity, and will be able to tell itself apart from other such thunks. This gives us a form of observable sharing, precisely what we need. But before we return to our parser combinators, let us briefly explore this combinator.

Using propriocept I can define an operation cons :: [Int] -> [Int] that records (the hash of) this Unique in the list:

ghci> let cons xs = propriocept (\x -> hashUnique x : xs)
ghci> :t cons
cons :: [Int] -> [Int]

This lets us see the identity of a list cell, that is, of the concrete object in memory.

Naturally, if we construct a finite list, each list cell is different:

ghci> cons (cons (cons []))
[1,2,3]

And if we do that again, we see that fresh list cells are allocated:

ghci> cons (cons (cons []))
[4,5,6]

We can create an infinite list; if we do it without sharing, every cell is separate:

ghci> take 20 (acyclic 0)
[7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]

but if we tie the knot using sharing, all the cells in the list are actually the same:

ghci> let cyclic = cons cyclic
ghci> take 20 cyclic
[27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27]

We can achieve the same using fix from Data.Function:

ghci> import Data.Function
ghci> take 20 (fix cons)
[28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28]

I explore these heap structures more visually in a series of screencasts.

So with propriocept we can distinguish different heap objects, and also recognize when we come across the same heap object again.

With that we return to our parser. We define a smart constructor for the new Parser that passes the unique from propriocept to the underlying parser’s memoise function:

withMemo :: P.Parser Unique tok a -> Parser tok a
withMemo p = propriocept $ \u -> MkP $ P.memoise u p

If we now use this in the definition of all possibly recursive parsers, then the necessary calls to memoise will be in place:

instance Functor (Parser tok) where
  fmap f p = withMemo (fmap f (unP p))

instance Applicative (Parser tok) where
  pure x = MkP (pure x)
  p1 <*> p2 = withMemo (unP p1 <*> unP p2)

instance Alternative (Parser tok) where
  empty = MkP empty
  p1 <|> p2 = withMemo (unP p1 <|> unP p2)

instance Monad (Parser tok) where
  return = pure
  p1 >>= f = withMemo $ unP p1 >>= unP . f

And indeed, it works (see RParser.hs for the full code):

ghci> let aaa = aaa <* tok 'a' <|> pure ()
ghci> parse aaa "aaaa"
Just ()
ghci> parse aaa "aabaa"
Nothing

A larger example

Let us try this on a larger example, and parse (simple) BNF grammars. Here is a data type describing them

type Ident = String
type RuleRhs = [Seq]
type Seq = [Atom]
data Atom = Lit String | NonTerm Ident deriving Show
type Rule = (Ident, RuleRhs)
type BNF = [Rule]

For the concrete syntax, I’d like to be able to parse something like

numExp :: String
numExp = unlines
    [ "term   := sum;"
    , "pdigit := '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9';"
    , "digit  := '0' | pdigit;"
    , "pnum   := pdigit | pnum digit;"
    , "num    := '0' | pnum;"
    , "prod   := atom | atom '*' prod;"
    , "sum    := prod | prod '+' sum;"
    , "atom   := num | '(' term ')';"
    ]

so here is a possible parser; mostly straight-forward use of parser combinator:

type P = Parser Char

snoc :: [a] -> a -> [a]
snoc xs x = xs ++ [x]

l :: P a -> P a
l p = p <|> l p <* sat isSpace
quote :: P Char
quote = tok '\''
quoted :: P a -> P a
quoted p = quote *> p <* quote
str :: P String
str = some (sat (not . (== '\'')))
ident :: P Ident
ident = some (sat (\c -> isAlphaNum c && isAscii c))
atom :: P Atom
atom = Lit     <$> l (quoted str)
   <|> NonTerm <$> l ident
eps :: P ()
eps = void $ l (tok 'ε')
sep :: P ()
sep = void $ some (sat isSpace)
sq :: P Seq
sq = []   <$ eps
 <|> snoc <$> sq <* sep <*> atom
 <|> pure <$> atom
ruleRhs :: P RuleRhs
ruleRhs = pure <$> sq
      <|> snoc <$> ruleRhs <* l (tok '|') <*> sq
rule :: P Rule
rule = (,) <$> l ident <* l (tok ':' *> tok '=') <*> ruleRhs <* l (tok ';')
bnf :: P BNF
bnf = pure <$> rule
  <|> snoc <$> bnf <*> rule

I somewhat sillily used snoc rather than (:) to build my lists, just so that I can show off all the left-recursion in this grammar.

Let’s try it:

ghci> parse bnf numExp
^CInterrupted.

What a pity, it does not work! What went wrong?

The underlying library can handle left-recursion if it can recognize it by seeing a memoise label passed again. This works fine in all the places where we re-use a parser definition (e.g. in bnf), but it really requires that values are shared!

If we look carefully at our definition of l (which parses a lexeme, i.e. something possibly followed by whitespace), then it recurses via a fresh function call, and the program will keep expanding the definition – just like the acyclic above:

l :: P a -> P a
l p = p <|> l p <* sat isSpace

The fix (sic!) is to make sure that the recursive call is using the parser we are currently defining, which we can easily do with a local definition:

l :: P a -> P a
l p = p'
  where p' = p <|> p' <* sat isSpace

With this little fix, the parser can parse the example grammar:

ghci> parse bnf numExp
Just [("term",[[NonTerm "sum"]]),("pdigit",[[Lit "1"],…

Going meta

The main demonstration is over, but since we now have already have a parser for grammar descriptions at hand, let’s go a bit further and dynamically construct a parser from such a description. The parser should only accept strings according to that grammar, and return a parse tree annotated with the non-terminals used:

interp :: BNF -> P String
interp bnf = parsers M.! start
  where
    start :: Ident
    start = fst (head bnf)

    parsers :: M.Map Ident (P String)
    parsers = M.fromList [ (i, parseRule i rhs) | (i, rhs) <- bnf ]

    parseRule :: Ident -> RuleRhs -> P String
    parseRule ident rhs = trace <$> asum (map parseSeq rhs)
      where trace s = ident ++ "(" ++ s ++ ")"

    parseSeq :: Seq -> P String
    parseSeq = fmap concat . traverse parseAtom

    parseAtom :: Atom -> P String
    parseAtom (Lit s) = traverse tok s
    parseAtom (NonTerm i) = parsers M.! i

Let’s see it in action (full code in BNFEx.hs):

ghci> Just bnfp = parse bnf numExp
ghci> :t bnfp
bnfp :: BNF
ghci> parse (inter
interact  interp
ghci> parse (interp bnfp) "12+3*4"
Just "term(sum(prod(atom(num(pnum(pnum(pdigit(1))digit(pdigit(2))))))+sum(prod(atom(num(pnum(pdigit(3))))*prod(atom(num(pnum(pdigit(4)))))))))"
ghci> parse (interp bnfp) "12+3*4+"
Nothing

It is worth noting that the numExp grammar is also left-recursive, so implementing interp with a conventional parser combinator library would not have worked. But thanks to our propriocept tick, it does! Again, the sharing is important; in the code above it is the map parsers that is defined in terms of itself, and will ensure that the left-recursive productions will work.

Closing thoughts

I am using unsafePerformIO, so I need to justify its use. Clearly, propriocept is not a pure function, and it’s type is a lie. In general, using it will break the nice equational properties of Haskell, as we have seen in our experiments with cons.

In the case of our parser library, however, we use it in specific ways, namely to feed a fresh name to memoise. Assuming the underlying parser library’s behavior does not observably depend on where and with which key memoise is used, this results in a properly pure interface, and all is well again. (NB: I did not investigate if this assumption actually holds for the parser library used here, or if it may for example affect the order of parse trees returned.)

I also expect that this implementation, which will memoise every parser involved, will be rather slow. It seems plausible to analyze the graph structure and figure out which memoise calls are actually needed to break left-recursion (at least if we drop the Monad instance or always memoise >>=).

If you liked this post, you might enjoy reading the paper about rec-def, watch one of my talks about it (MuniHac, BOBKonf, ICFP23; the presentation evolved over time), or if you just want to see more about how things are laid out on Haskell’s heap, go to my screen casts exploring the Haskell heap.

Comments

Have something to say? You can post a comment by sending an e-Mail to me at <mail@joachim-breitner.de>, and I will include it here.

Go up