Learning Parser Combinators: Part 1
Today, I decided I would begin learning how to use parser combinators. Now I’ve been floating around PLDev communities for quite a long time now, and everyone under the sun who enjoys declarative and functional programming has told me to try parser combinators, that they would be one of the best things that I have ever used in terms of extensible parsing. Now admittedly I have always kinda blown this off, and that’s not to say I haven’t tried it before. When I attempted to learn these combinators, I was told to start with Parsec, a very lovely Haskell package, and it really is quite beautifully made.
But there’s a problem here, and it’s that, historically, me and Haskell really don’t get along. So I gave it a shot anyways, and of course I didn’t enjoy it at all. So I thought “alright, let’s try OCaml then!” And… Yeah, it didn’t really work out either. Angstrom is an amazing library just like Parsec is, but again as I found out, OCaml and I really don’t play nicely with each other, which probably won’t make sense once you see where I go after this.
So I had to sit back and reevaluate on my experiences recently, as all of that was about three years ago. I had to think, just what functional programming languages do I actually get along with here? The answer is two languages: Erlang and F#. In the past those are the only two functional languages I have ever enjoyed using, as I have kind of kept to myself in imperative-land. Originally I had looked at FParsec as the library to learn with, but seeing as the last update was four years ago (which isn’t inherently a bad thing!) and there is still some quite notable jank with the library, I instead looked at XParsec, which is basically a more modernized version of FParsec.
A Simple Config Format
So I decided to make a simple config format, not just because it was the example in the XParsec docs, but also because it is much easier of a thing to handle, print, and make an AST out of. But this article isn’t a tutorial, it is more a detail of my current thought process, and what code it took to get there.
1open System
2open XParsec
3open ConfigFormat.Parser
4open ConfigFormat.Types
5
6let configText =
7 """
8# My cool config format
9name = "XParsec"
10version = 1.2
11is_beta = true
12"""
13
14let result = Reader.ofString configText () |> pConfigFile ()
15
16match result with
17| Ok success ->
18 printfn "Success!"
19
20 for kvp in success do
21 printfn $"- {kvp.Key}: {kvp.Value}"
22| Error err ->
23 let errorMsg = ErrorFormatting.formatStringError configText err
24 printfn "Error parsing config:\n%s" errorMsg
This is the entrypoint to my simple little config format, there isn’t too much to actually talk about here, it’s just some code with some pattern matching and some sample code. But we can get into the meat and potatoes here:
1namespace ConfigFormat.Types
2
3type ConfigValue =
4 | String of string
5 | Float of float
6 | Bool of bool
7
8type KeyValuePair = { Key: string; Value: ConfigValue }
This is the world’s simplest AST, and if you’re like me, who comes from languages like C++ and D, you probably looked at this and said something along the lines of “no way!” But I can actually say here, maybe my friends were actually onto something when they told me functional programming languages were better than imperative ones for writing compilers. There also really isn’t too much to talk about here.
1module ConfigFormat.Parser
2
3open System
4open ConfigFormat.Types
5open XParsec
6open XParsec.Parsers
7open XParsec.CharParsers
Here is what we import. Is it that important to know? No, but I figured I would show what got imported before I get into the meat of this parser.
1// Parser for a valid identifier in our config format
2let pIdentifier () =
3 many1Chars (satisfy (fun c -> Char.IsLetterOrDigit c || c = '_'))
4
5// A parser for our string literals that are enclosed within
6// double quotes.
7let pQuotedString () =
8 between (pchar '"') (pchar '"') (manyChars (noneOf [ '"' ]))
9
10// Our parser for comments, they start with an octothorpe and
11// continue until the end of the line.
12let pComment () =
13 pchar '#' >>. skipMany (satisfy (fun c -> c <> '\n'))
14
15// Whitespace helper
16// Apparently, `<|>` is the "choice operator" and if the left
17// parser fails, it tries the right one.
18let pWhitespaceOrComment () = skipMany (spaces1 <|> pComment ())
Now here is a reminder that I’m also relearning modern F# while writing this, I noticed that in
module space you can’t have a generic value that may or may not be used as a function value, so
you either need to specify the type signature yourself, or add the unit argument, shown as (),
in front of the definition, effectively turning it into a function. Is there a better way to do
this? Probably.
To start, pIdentifier is our identifier parsing/grammar rule. It takes one or more characters that
consist of [A-Za-z0-9_] (sorry about the regex!) and matches them, returning a Parser object.
The next rule is pQuotedString, which takes zero or more characters, none of which can consist of
", that are all between two double quotes, and matches them, doing the same thing as the above
rule.
The next rule, pComment, is pretty self explanatory. It matches an octothorpe and then matches
everything until the end of the current line.
Finally we have pWhitespaceOrComment, which matches any sort of whitespace or comment.
Okay, so you may be reading this and be asking if I really needed these all split using line breaks, and your answer is no, I didn’t, but it feels like it gets the point I’m pushing for across better. You may be slowly realizing right now what the name “parser combinators” actually means, and you may also see the similarities between these rules and any sort of BNF/EBNF formal grammar laying around. I slowly realized what the name actually meant as I got to around this point, you create a bunch of little parsers, that parse their own things, and at the end you combine them all together. Get it? Parser Combinators? Parsers that get combined into one big and complex parser from a bunch of smaller parsers? Yeah, makes sense now.
Value Combinators
Up next are all of our value combinators, which are basically just parsers that propagate into one of our value types and returns them.
1// Use the map operator `|>>` to transform the parsed result
2// into our `ConfigValue`.
3let pString () = pQuotedString () |>> ConfigValue.String
4
5let pFloat () =
6 pfloat .>> notFollowedBy (satisfy (fun c -> Char.IsLetter c))
7 |>> ConfigValue.Float
8
9// `pstring "true" >>% true` parses the string "true" and returns true
10let pBool () =
11 pstring "true" >>% true <|> (pstring "false" >>% false) |>> ConfigValue.Bool
12
13// `choice` tries a list of parsers in order until one works
14let pValue () =
15 choice [ pString (); pFloat (); pBool () ]
All of these rules apply to each of our types that we defined earlier. They use our previous parsing
rules that we defined and then map the value or values that come from those parsers to an object
that we can return. But you may be noticing one thing: where is the KeyValuePair type?
1let pKeyValuePair () =
2 parser {
3 let! key = pIdentifier ()
4 do! pWhitespaceOrComment () >>. pchar '=' >>. pWhitespaceOrComment ()
5 let! value = pValue ()
6 return { Key = key; Value = value }
7 }
This combines our smallar grammars into a larger grammar for our KVPairs, and just by reading and seeing this code you can probably really begin to get what I mean by “combinators.” It’s a bit more complex than the other rules, but it is mainly just a combination of them, which is why I don’t think I need to expand upon this code too much, as it actually reads quite similarly to EBNF or even English.
1// `many` parses zero or more `pKeyValuePair`s
2let pConfigFile () =
3 between (pWhitespaceOrComment ()) (pWhitespaceOrComment ()) (many (pKeyValuePair () .>> skipNewline))
4 .>> eof
This is the root terminal rule of our configuration language’s grammar. It basically tells the parser “parse any amount of key-value pairs separated by newlines, between any amount of comments or whitespace, consume all input, and return the parsed results,” and I can’t lie, it’s starting to remind me of array programming and even concatenative programming. Hmm, maybe I should write some posts about concatenative programming.
Combinators?
Combinators are amazing, and I originally thought they were way too overhyped/overblown as a good way of writing a compiler, and as someone used to PEGs or EBNF libraries, I was almost about to refuse to learn these. I am really glad I did, and I was definitely impressed.
I have no idea how many parts I’m going to do of this, but hopefully it is a lot. Either way, I thank you for reading this, and I hope to see you in the next post!
(You can find the code’s repo here.)