Joachim Breitner's Homepage
Left recursive parser combinators via sharing
At this year’s ICFP in Seattle I gave a talk about my rec-def Haskell library, which I have blogged about before here. While my functional pearl paper focuses on a concrete use-case and the tricks of the implementation, in my talk I put the emphasis on the high-level idea: it beholds of a declarative lazy functional like Haskell that recursive equations just work whenever they describe a (unique) solution. Like in the paper, I used equations between sets as the running example, and only conjectured that it should also work for other domains, in particular parser combinators.
Naturally, someone called my bluff and asked if I actually tried it. I had not, but I should have, because it works nicely and is actually more straight-forward than with sets. I wrote up a prototype and showed it off a few days later as a lightning talk at Haskell Symposium; here is the write up that goes along with that.
Parser combinators
Parser combinators are libraries that provide little functions (combinators) that you compose to define your parser directly in your programming language, as opposed to using external tools that read some grammar description and generate parser code, and are quite popular in Haskell (e.g. parsec, attoparsec, megaparsec).
Let us define a little parser that recognizes sequences of a
s:
> let aaa = tok 'a' *> aaa <|> pure ()
ghci> parse aaa "aaaa"
ghciJust ()
> parse aaa "aabaa"
ghciNothing
Left-recursion
This works nicely, but just because we were lucky: We wrote the parser to recurse on the right (of the *>
), and this happens to work. If we put the recursive call first, it doesn’t anymore:
> let aaa = aaa <* tok 'a' <|> pure ()
ghci> parse aaa "aaaa"
ghci^CInterrupted.
This is a well-known problem (see for example Nicolas Wu’s overview paper), all the common parser combinator libraries cannot handle it and the usual advise is to refactor your grammar to avoid left recursion.
But there are some libraries that can handle left recursion, at least with a little help from the programmer. I found two variations:
The library provides an explicit fix point combinator, and as long as that is used, left-recursion works. This is for example described by Frost, Hafiz and Callaghan by, and (of course) Oleg Kiselyov has an implementation of this too.
The library expects explicit labels on recursive productions, so that the library can recognize left-recursion. I found an implementation of this idea in the
Agda.Utils.Parser.MemoisedCPS
module in the Agda code, thegll
library seems to follow this style and Jaro discusses it as well.
I took the module from the Agda source and simplified a bit for the purposes of this demonstration (Parser.hs
). Indeed, I can make the left-recursive grammar work:
> let aaa = memoise ":-)" $ aaa <* tok 'a' <|> pure ()
ghci> parse aaa "aaaa"
ghciJust ()
> parse aaa "aabaa"
ghciNothing
It does not matter what I pass to memoise
, as long as I do not pass the same key when memoising a different parser.
For reference, an excerpt of the the API of Parser
:
data Parser k tok a -- k is type of keys, tok type of tokens (e.g. Char)
instance Functor (Parser k tok)
instance Applicative (Parser k tok)
instance Alternative (Parser k tok)
instance Monad (Parser k tok)
parse :: Parser k tok a -> [tok] -> Maybe a
sat :: (tok -> Bool) -> Parser k tok tok
tok :: Eq tok => tok -> Parser k tok tok
memoise :: Ord k => k -> Parser k tok a -> Parser k tok a
Left-recursion through sharing
To follow the agenda set out in my talk, I now want to wrap that parser in a way that relieves me from having to insert the calls to memoise
. To start, I import that parser qualified, define a newtype around it, and start lifting some of the functions:
import qualified Parser as P
newtype Parser tok a = MkP { unP :: P.Parser Unique tok a }
parse :: Parser tok a -> [tok] -> Maybe a
MkP p) = P.parse p
parses (
sat :: Typeable tok => (tok -> Bool) -> Parser tok tok
= MkP (P.sat p)
sat p
tok :: Eq tok => tok -> Parser tok tok
= MkP (P.tok t) tok t
So far, nothing interesting had to happen, because so far I cannot build recursive parsers. The first interesting combinator that allows me to do that is <*>
from the Applicative
class, so I should use memoise
there. The question is: Where does the unique key come from?
Proprioception
As with the rec-def library, pure code won’t do, and I have to get my hands dirty: I really want a fresh unique label out of thin air. To that end, I define the following combinator, with naming aided by Richard Eisenberg:
propriocept :: (Unique -> a) -> a
= unsafePerformIO $ f <$> newUnique propriocept f
A thunk defined with propriocept
will know about it’s own identity, and will be able to tell itself apart from other such thunks. This gives us a form of observable sharing, precisely what we need. But before we return to our parser combinators, let us briefly explore this combinator.
Using propriocept
I can define an operation cons :: [Int] -> [Int]
that records (the hash of) this Unique
in the list:
> let cons xs = propriocept (\x -> hashUnique x : xs)
ghci> :t cons
ghcicons :: [Int] -> [Int]
This lets us see the identity of a list cell, that is, of the concrete object in memory.
Naturally, if we construct a finite list, each list cell is different:
> cons (cons (cons []))
ghci1,2,3] [
And if we do that again, we see that fresh list cells are allocated:
> cons (cons (cons []))
ghci4,5,6] [
We can create an infinite list; if we do it without sharing, every cell is separate:
> take 20 (acyclic 0)
ghci7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26] [
but if we tie the knot using sharing, all the cells in the list are actually the same:
> let cyclic = cons cyclic
ghci> take 20 cyclic
ghci27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27,27] [
We can achieve the same using fix
from Data.Function
:
> import Data.Function
ghci> take 20 (fix cons)
ghci28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28,28] [
I explore these heap structures more visually in a series of screencasts.
So with propriocept
we can distinguish different heap objects, and also recognize when we come across the same heap object again.
Left-recursion through sharing (cont.)
With that we return to our parser. We define a smart constructor for the new Parser
that passes the unique from propriocept
to the underlying parser’s memoise
function:
withMemo :: P.Parser Unique tok a -> Parser tok a
= propriocept $ \u -> MkP $ P.memoise u p withMemo p
If we now use this in the definition of all possibly recursive parsers, then the necessary calls to memoise
will be in place:
instance Functor (Parser tok) where
fmap f p = withMemo (fmap f (unP p))
instance Applicative (Parser tok) where
pure x = MkP (pure x)
<*> p2 = withMemo (unP p1 <*> unP p2)
p1
instance Alternative (Parser tok) where
= MkP empty
empty <|> p2 = withMemo (unP p1 <|> unP p2)
p1
instance Monad (Parser tok) where
return = pure
>>= f = withMemo $ unP p1 >>= unP . f p1
And indeed, it works (see RParser.hs
for the full code):
> let aaa = aaa <* tok 'a' <|> pure ()
ghci> parse aaa "aaaa"
ghciJust ()
> parse aaa "aabaa"
ghciNothing
A larger example
Let us try this on a larger example, and parse (simple) BNF grammars. Here is a data type describing them
type Ident = String
type RuleRhs = [Seq]
type Seq = [Atom]
data Atom = Lit String | NonTerm Ident deriving Show
type Rule = (Ident, RuleRhs)
type BNF = [Rule]
For the concrete syntax, I’d like to be able to parse something like
numExp :: String
= unlines
numExp "term := sum;"
[ "pdigit := '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9';"
, "digit := '0' | pdigit;"
, "pnum := pdigit | pnum digit;"
, "num := '0' | pnum;"
, "prod := atom | atom '*' prod;"
, "sum := prod | prod '+' sum;"
, "atom := num | '(' term ')';"
, ]
so here is a possible parser; mostly straight-forward use of parser combinator:
type P = Parser Char
snoc :: [a] -> a -> [a]
= xs ++ [x]
snoc xs x
l :: P a -> P a
= p <|> l p <* sat isSpace
l p quote :: P Char
= tok '\''
quote quoted :: P a -> P a
= quote *> p <* quote
quoted p str :: P String
= some (sat (not . (== '\'')))
str ident :: P Ident
= some (sat (\c -> isAlphaNum c && isAscii c))
ident atom :: P Atom
= Lit <$> l (quoted str)
atom <|> NonTerm <$> l ident
eps :: P ()
= void $ l (tok 'ε')
eps sep :: P ()
= void $ some (sat isSpace)
sep sq :: P Seq
= [] <$ eps
sq <|> snoc <$> sq <* sep <*> atom
<|> pure <$> atom
ruleRhs :: P RuleRhs
= pure <$> sq
ruleRhs <|> snoc <$> ruleRhs <* l (tok '|') <*> sq
rule :: P Rule
= (,) <$> l ident <* l (tok ':' *> tok '=') <*> ruleRhs <* l (tok ';')
rule bnf :: P BNF
= pure <$> rule
bnf <|> snoc <$> bnf <*> rule
I somewhat sillily used snoc
rather than (:)
to build my lists, just so that I can show off all the left-recursion in this grammar.
Sharing is tricky
Let’s try it:
> parse bnf numExp
ghci^CInterrupted.
What a pity, it does not work! What went wrong?
The underlying library can handle left-recursion if it can recognize it by seeing a memoise
label passed again. This works fine in all the places where we re-use a parser definition (e.g. in bnf
), but it really requires that values are shared!
If we look carefully at our definition of l
(which parses a lexeme, i.e. something possibly followed by whitespace), then it recurses via a fresh function call, and the program will keep expanding the definition – just like the acyclic
above:
l :: P a -> P a
= p <|> l p <* sat isSpace l p
The fix (sic!) is to make sure that the recursive call is using the parser we are currently defining, which we can easily do with a local definition:
l :: P a -> P a
= p'
l p where p' = p <|> p' <* sat isSpace
With this little fix, the parser can parse the example grammar:
> parse bnf numExp
ghciJust [("term",[[NonTerm "sum"]]),("pdigit",[[Lit "1"],…
Going meta
The main demonstration is over, but since we now have already have a parser for grammar descriptions at hand, let’s go a bit further and dynamically construct a parser from such a description. The parser should only accept strings according to that grammar, and return a parse tree annotated with the non-terminals used:
interp :: BNF -> P String
= parsers M.! start
interp bnf where
start :: Ident
= fst (head bnf)
start
parsers :: M.Map Ident (P String)
= M.fromList [ (i, parseRule i rhs) | (i, rhs) <- bnf ]
parsers
parseRule :: Ident -> RuleRhs -> P String
= trace <$> asum (map parseSeq rhs)
parseRule ident rhs where trace s = ident ++ "(" ++ s ++ ")"
parseSeq :: Seq -> P String
= fmap concat . traverse parseAtom
parseSeq
parseAtom :: Atom -> P String
Lit s) = traverse tok s
parseAtom (NonTerm i) = parsers M.! i parseAtom (
Let’s see it in action (full code in BNFEx.hs
):
> Just bnfp = parse bnf numExp
ghci> :t bnfp
ghcibnfp :: BNF
> parse (inter
ghciinteract interp
> parse (interp bnfp) "12+3*4"
ghciJust "term(sum(prod(atom(num(pnum(pnum(pdigit(1))digit(pdigit(2))))))+sum(prod(atom(num(pnum(pdigit(3))))*prod(atom(num(pnum(pdigit(4)))))))))"
> parse (interp bnfp) "12+3*4+"
ghciNothing
It is worth noting that the numExp
grammar is also left-recursive, so implementing interp
with a conventional parser combinator library would not have worked. But thanks to our propriocept
tick, it does! Again, the sharing is important; in the code above it is the map parsers
that is defined in terms of itself, and will ensure that the left-recursive productions will work.
Closing thoughts
I am using unsafePerformIO
, so I need to justify its use. Clearly, propriocept
is not a pure function, and it’s type is a lie. In general, using it will break the nice equational properties of Haskell, as we have seen in our experiments with cons
.
In the case of our parser library, however, we use it in specific ways, namely to feed a fresh name to memoise
. Assuming the underlying parser library’s behavior does not observably depend on where and with which key memoise
is used, this results in a properly pure interface, and all is well again. (NB: I did not investigate if this assumption actually holds for the parser library used here, or if it may for example affect the order of parse trees returned.)
I also expect that this implementation, which will memoise every parser involved, will be rather slow. It seems plausible to analyze the graph structure and figure out which memoise
calls are actually needed to break left-recursion (at least if we drop the Monad
instance or always memoise >>=
).
If you liked this post, you might enjoy reading the paper about rec-def, watch one of my talks about it (MuniHac, BOBKonf, ICFP23; the presentation evolved over time), or if you just want to see more about how things are laid out on Haskell’s heap, go to my screen casts exploring the Haskell heap.
Have something to say? You can post a comment by sending an e-Mail to me at <mail@joachim-breitner.de>, and I will include it here.