Creating a parser with Parse::RecDescent

Thanks God, configuration files are everywhere, so that you don't have configuration information hardcoded into your programs anymore. And if you have… well, at least you know it's bad practice, don't you?

Anyway, configuration files need to be parsed to extract the information they hold. If the format is a well-known one, you will probably find a library for your favourite language that will do the work for you. If it's not, but it's line-based and simple enough, then writing a parser may be easy. If it's an XML file, things get more complicated but still doable: just use one of the many parsing library around, look for the relevant XML elements, and you're done. Or, if it is up to you to choose the file format, you may use one that is tailored for your configuration library of choice –mine happens to be the AppConfig Perl module.

But, what to do when you fall out of the easy cases? …In my specific case, I had a 350kB file that had a pseudo-ini format. It was an ini file, to some extent, since it had section declarations like this:

[section_name]

and key/value associations like this:

	parameter=value
	values="may also be quoted"

It could have comment lines

	; like this

and it allowed for blank lines.

The difference was in that it admitted multiple assignments to the same parameter, resulting in an array of values for that parameter:

	parameter=value1
	parameter=value2
	parameter=value3
	
	; that would result in parameter = ( value1, value2, value3 )

But hold on, there is more. The biggest challenge was that a parameter could be assigned a full Pike multivalue data structure, that is an array:

	parameter=({ value1, value2, value3 })

or a mapping (key/value pairs):

	parameter=([ "key1" : "value1", "key2" : "value2" ])

Not enough? Well, there's more: these structures could be multilined and nested!!! As in:

	parameter=({
	            ([
	                "key1" : "value1",
	                "key2" : ([ "array", "value" ]),
	            ]),
	            "second element of this array",
	            ({
	                "and here is another array",
	                ({
	                    "with another one nested",
	                }),
	                ([
	                    "that" : "contains",
	                    "one"  : "more",
	                    "hash" : "value",
	                ]),
	            }),
	            "Hooray!",
	          })

Ouch! Nested data structures and multiline values! A nightmare, isn't it?

So, I needed to parse this file. I was able to describe the syntax of the file –in fact, that's what I just did, didn't I? So, what I needed was to write a formal description of this syntax (that is: a grammar), and throw it to a parser generator. I remembered about that class I took back in 1999 about the Perl module Parse::RecDescent… ah, those fond memories of the Open Source Software Convention in the US!!!

After 11 years, it was finally time to apply what I learnt in that class.

But, as it turned out, Parse::RecDescent is a quite powerful module, and properly using it can get as complicated. That's why I decided to write down this step-by-step article on how to use it. Of course, some knowledge of Perl is assumed, as I will liberally talk about Perl data structures, and references, and other Perl things without any explaination.

So, the first thing we need is a formal description of the file syntax, that is: a grammar. Let's start.

This kind of ini file contains 0 or more lines, until the end of the file. I'll express it like this:

AsIni: Line(s?) /Z/

But what's a Line like? Well, it could be a comment line, or a blank line, or a section declaration, or an assignment line. No other line types are allowed in this file:

Line:   CommentLine
                | BlankLine
                | SectionDeclaration
                | AssignmentLine
                | <error>

Oh, nice! Now, what's a CommentLine like? Well, it may have some heading whitespace, then it has a semi-colon, then anything up the end of the line. I'll write it this way:

CommentLine:    <skip: q{}> /^s*/m ';' /.*$/m

What's that skip thing? Well, by default Parse::RecDescent will isolate the "tokens" that compose the file, throwing away the exceeding whitespace. If we want to express that a line has heading whitespace, we need to change this behaviour when we scan for that line. We'll use the same trick as we scan for a blank line:

BlankLine:      <skip: q{}> /^s+$/m

And… what's that "m" at the end of the pattern matching? This is because you feed the text to the parser as a single string, that will likely have newlines embedded in it. The "m" is a pattern matching modifier that, in a sense, will tell Perl to match start/end of line inside that string.

What's a section declaration line like?

SectionDeclaration:     '[' /[^]]+/ ']'

That is: a literal '[', then a string that doesn't contain a closing bracket, then a closing bracket. It was easy.

Let's get to the assignment line:

AssignmentLine: Parameter '=' Value(?)

That is: a parameter name, a literal equal sign, and a value, or no value (which will mean that the parameter is an undef). "(?)" is a repetition, that means "0 or 1 of this".

OK, we expressed the Assignment line, but what's a parameter like?

Parameter:      /w[ws-]*/

That is: it starts with a "word character" (think of it as an alphanumeric character, plus the underscore), that may be followed by a sequence of other word characters, or "space characters" (tabs, whitespace…) or dashes.

The value, as we said, could be a pike structure or a string.

Value:          PikeStructure   | ValueString

Let's see the strings, they are much easier. We said that they could be quoted or unquoted, and we need to express the two concepts:

ValueString:    QuotedString    | UnquotedString

QuotedString:   '"' /[^"]+/ '"'

UnquotedString: /.+/

That means: a quoted string starts and ends with a double quote and contains "something quoteless". An unquoted string is just anything. Note that we are trying to match the quoted string first, otherwise the unquoted string would match even if it was not intended to. This also gives me the opportunity to say that the parsers produced by Parse::RecDescent are "first match" ones, that is: the first of many possible rules that matches, wins. That's why you should try to match the more specific rules first.

And let's get to the PikeStructure now. If you didn't have coffee, go get one and come back because this will be long and a bit complicated. Just go, I'll wait.

Back? OK. The first level of definition is quite easy indeed, and so is the second:

PikeStructure:  PikeArray | PikeMapping

PikeArray:      '({' PikeArrayContent '})'

PikeMapping:    '([' PikeMappingContent '])'

Let's work on the arrays first. The content of an array could be:

  1. no content (yes, an empty array!)
  2. one element
  3. two or more elements, separated by commas

Besides, a trailing comma will be allowed. We'll call the comma "PikeStructureSeparator", and the sequence of array values will be called "PikeArraySequence". Let's see them:

PikeStructureSeparator: ','

PikeArrayContent:       PikeArraySequence(?) PikeStructureSeparator(?)

So, we may have zero or one PikeArraySequence's (why zero or one? Keep reading!), and zero or one trailing PikeStructureSeparator's. Incidentally, this allows for an empty array with just a comma inside, but we'll pretend it's OK 😉

Now let's formally define the sequence from the human-readable description we have a few lines above:

PikeArraySequence:      PikeValue PikeArrayFurtherValue(s?)

This means: at least one PikeValue, plus zero or more further values. Now you start to see the strategy here: the zero/one repetition of "PikeArraySequence(?)" will take care of the case where the array is empty, the single PikeValue will take care for the one-element array, and the PikeArrayFurtherValue(s?) for all the rest.

What is a PikeValue, anyway?

PikeValue:              PikeStructure | QuotedString | Number

A-ha! So a PikeValue is: another PikeStructure, or a QuotedString, or a Number… wait! What is a number?

Number: /[+-]?d*(.d+)?/ <reject: $item[1] eq ''>

Hmmmm… this means it may have a sign, then it may have digits, and then it may have a dot and some more trailing digits. OK, but this also will match a null string. Do we want to match this? No, a null string is not a number. We express that with the reject directive. That means: if the matched string is a null string, then I haven't matched.

And now we know what is a PikeValue. What is a PikeArrayFurtherValue? Well, if you think about this, it's just a matter of keep listing PikeValue's after the first one, so:

PikeArrayFurtherValue:  PikeStructureSeparator PikeValue

Now let's go back and you will see that a PikeArraySequence is a PikeValue, that may be followed by zero or many ", Pikevalue"s…

And that's it for the arrays!!! Let's turn to the mappings now. We already learnt something with the arrays, so the mappings will need less explaination.

PikeMappingContent:     PikeMappingSequence(?) PikeStructureSeparator(?)

PikeMappingSequence:    PikeMappingPair PikeMappingFurtherPair(s?)

PikeMappingPair:        QuotedString ':' PikeValue

PikeMappingFurtherPair: PikeStructureSeparator PikeMappingPair

Basically, a PikeMapping is a sequence of zero or more pairs of "QuotedString : PikeValue", that may have a trailing comma.

Let's put it all together.

AsIni: Line(s?) /Z/

Line:   CommentLine
                | BlankLine
                | SectionDeclaration
                | AssignmentLine
                | <error>
                
CommentLine:    <skip: q{}> /^s*/m ';' /.*$/m

BlankLine:      <skip: q{}> /^s+$/m

SectionDeclaration:     '[' /[^]]+/ ']'

AssignmentLine: Parameter '=' Value(?)

Parameter:      /w[ws-]*/

Value:          PikeStructure   | ValueString

ValueString:    QuotedString    | UnquotedString

QuotedString:   '"' /[^"]+/ '"'

UnquotedString: /.+/

PikeStructure:  PikeArray | PikeMapping

PikeArray:      '({' PikeArrayContent '})'

PikeMapping:    '([' PikeMappingContent '])'

PikeStructureSeparator: ','

PikeArrayContent:       PikeArraySequence(?) PikeStructureSeparator(?)

PikeArraySequence:      PikeValue PikeArrayFurtherValue(s?)

PikeValue:              PikeStructure | QuotedString | Number

Number: /[+-]?d*(.d+)?/ <reject: $item[1] eq ''>

PikeArrayFurtherValue:  PikeStructureSeparator PikeValue

PikeMappingContent:     PikeMappingSequence(?) PikeStructureSeparator(?)

PikeMappingSequence:    PikeMappingPair PikeMappingFurtherPair(s?)

PikeMappingPair:        QuotedString ':' PikeValue

PikeMappingFurtherPair: PikeStructureSeparator PikeMappingPair

You may read a tree-like or graph-like structure in the grammar, and you are actually right: you start from a "root" (AsIni in our case) and you go down deep into the structure of the file in a tree-like manner. That's how actually Parse::RecDescent works, starting from the most general rule and then trying to match the subrules, going deeper and deeper; if a branch fails, then the parser backtracks to the nearest branch and goes down again, and so on.

So we have a grammar, and if we ask Parse::RecDescent to digest it and parse our pseudo-ini file, it will actually parse it. But it will do nothing with it. Why? Well, you may know that computers are stupid machines indeed, and if you don't tell them to do something, they'll assume they have nothing to do! We must explicitly say to Parse::RecDescent what to do once it matches a rule. You do this by attaching actions to the rules of your grammar, that is: snippets of perl code.

Although you may do anything in your actions, I found that the best thing for my specific case was to let the matched values "bubble" themselves up in the tree, and manage them when they reached the AssignmentLine rule. I will use a nested hash structure to store the values, where the first level is the section name, the second is the parameter name, and the third is the array of values for that parameter, that is:

  $AsIni::node{$section_name}{$parameter_name}[$value_index]

In order to do this, you need to know a few more things.

The first, the matched items in a rule are automatically placed in an array structure called @item and in an hash structure called %item. We'll care just about the @item in this example.

What happens is that when the string

"ParseMe"

is matched by the rule

QuotedString:   '"' /[^"]+/ '"'

then:

  • $item[0] will hold the rule name, that is: QuotedString
  • $item[1] will hold the value of the literal '"', that is… well, "
  • $item[2] will hold the string matched by the pattern, that is: ParseMe
  • $item[3] will hold the value of the literal '"' again

This slightly changes if repetitions are involved –I mean those "(?)", "(s?)" stuff following subrules. In that case, the corresponding @item elements will actually hold a reference to an array of the matched values. That is, when the line

	cheese=pecorino

is matched by the AssignmentLine rule, then:

  • $item[0] will hold the rule name, that is: 'AssignmentLine'
  • $item[1] will hold the Parameter value, that is: "cheese"
  • $item[2] will hold the literal '=' value, that is, well: "="
  • $item[3] will hold a reference to an array of the matched Value's. In this case, and in Perl parlance, it will be [ "pecorino" ]

Last, $return is a special variable that will hold the value you want the action to return if the whole rule matches. If it doesn't, then $return's value is discarded. This is quite a convenient thing indeed, because if you tried to store matched values on the go, you may get unexpected results if the more general rule fails and the parser backtracks to another rule –you'd have the matched values added to your data structures even if the general rule failed… not good!

Now that we know this, let's start to "bubble up" values, starting from the leaves of the tree: QuotedString, UnquotedString, and Number. That's quite easy indeed:

QuotedString:   '"' /[^"]+/ '"'
  {
    $return = $item[2] ;
  }

UnquotedString: /.+/
  {
    $return = $item[1] ;
  }

# This rule matches a number, but rejects null-length results
Number: /[+-]?d*(.d+)?/ <reject: $item[1] eq ''>
  {
    $return = $item[1] ;
  }

While for rules with the (?) repetition, we may need to unwrap the value from the array:

PikeMappingContent:     PikeMappingSequence(?) PikeStructureSeparator(?)
  {
    # Since we have a repetition here, $item[1] is a reference to an
    # array which may contain 0 or 1 PikeMappingSequence's.
    # In turn, PikeMappingSequence returns an hash reference.
    # So, if we want the hash reference to bubble up, we have to
    # unwrap it and return it as is.

    ( $return ) = @{ $item[1] } ;
  }

…and so on. When we match a section name we will save its name in a package variable for later:

SectionDeclaration:     '[' /[^]]+/ ']'
  {
    print STDERR qq{In section "$item[2]"n} ;

    my $sectionname = $item[2] ;

    $AsIni::section = $sectionname ;
  }

And when finally the AssignmentLine rule matches, we'll save the values into the %AsIni::node hash:

AssignmentLine: Parameter '=' Value(?)
  {
    my $distvalue = $item[3] ;
    my $parmname  = $item[1] ;
    my $paramvalue ;

    ( $paramvalue ) = @$distvalue ;

    if ( not exists $AsIni::node{$AsIni::section}{$parmname} ) {
      $AsIni::node{$AsIni::section}{$parmname} = [] ;
    }

    # Get a reference to the current array of values for this parameter
    # in $current
    my $current = $AsIni::node{$AsIni::section}{$parmname} ;

    # We can update this safely, since we are using the reference
    push @$current,$paramvalue ;

  }

The final version of the grammar is the following:

# $Id: g3.txt,v 1.14 2010/07/23 12:41:24 bronto Exp bronto $

AsIni: Line(s?) /Z/

Line:	CommentLine
		| BlankLine
		| SectionDeclaration
		| AssignmentLine
		| <error>


CommentLine:	<skip: q{}> /^s*/m ';' /.*$/m
  {
    print STDERR qq{tSkipping comment: $item[4]n} ;
  }

BlankLine:	<skip: q{}> /^s+$/m
  {
    print STDERR qq{tSkipping blank linen}
  }

SectionDeclaration:	'[' /[^]]+/ ']'
  {
    print STDERR qq{In section "$item[2]"n} ;

    my $sectionname = $item[2] ;

    $AsIni::section = $sectionname ;
  }

AssignmentLine:	Parameter '=' Value(?)
  {
    my $distvalue = $item[3] ;
    my $parmname  = $item[1] ;
    my $paramvalue ;

    ( $paramvalue ) = @$distvalue ;

    if ( not exists $AsIni::node{$AsIni::section}{$parmname} ) {
      $AsIni::node{$AsIni::section}{$parmname} = [] ;
    }

    # Get a reference to the current array of values for this parameter
    # in $current
    my $current = $AsIni::node{$AsIni::section}{$parmname} ;

    # We can update this safely, since we are using the reference
    push @$current,$paramvalue ;

  }

Parameter:	/w[ws-]*/
  {
    $return = $item[1] ;
  }

Value:		PikeStructure	| ValueString
  {
    $return = $item[1] ;
  }

ValueString:	QuotedString	| UnquotedString
  {
    $return = $item[1] ;
  }

QuotedString:	'"' /[^"]+/ '"'
  {
    $return = $item[2] ;
  }

UnquotedString:	/.+/
  {
    $return = $item[1] ;
  }

# This rule matches a number, but rejects null-length results
Number:	/[+-]?d*(.d+)?/ <reject: $item[1] eq ''>
  {
    $return = $item[1] ;
  }

PikeStructure:	PikeArray	| PikeMapping
  {
    # $item[1] is a reference to an array (PikeArray) or hash
    # (PikeMapping). We bubble it up as is
    $return = $item[1] ;
  }

PikeArray:	'({' PikeArrayContent '})'
  {
    # $item[2] is a PikeArrayContent, and since PikeArrayContent
    # bubbles up a reference to an array of PikeValues, this should be
    # an array reference that we can safely bubble up as is.
    $return = $item[2] ;
  }

PikeMapping:	'([' PikeMappingContent '])'
  {
    # $item[2] is a PikeMappingContent, and since PikeMappingContent
    # bubbles up an hash reference, this should be an hash reference that we
    # can safely bubble up as is.
    $return = $item[2] ;
  }

PikeStructureSeparator:	','

PikeArrayContent:	PikeArraySequence(?) PikeStructureSeparator(?)
  {
    # $item[1] comes from a repetition of PikeArraySequence,
    # so it is a reference to an array of 0 or 1 PikeArraySequence.
    # In turn, PikeArraySequence is a reference to an array of
    # PikeValue's. We don't want to change the PikeValue's but we need
    # to unroll $item[1] before bubbling it up.
    ( $return ) = @{ $item[1] } ;
  }

PikeArraySequence:	PikeValue PikeArrayFurtherValue(s?)
  {
    # $item[1] is a PikeValue, hence:
    # - a reference to an array or hash (if PikeStructure)
    # - a scalar (if QuotedString or Number)
    #
    # $item[2] comes from a repetition of PikeArrayFurtherValue,
    # so it is a reference to an array of 0 or 1 PikeArrayFurtherValue.
    # In turn, PikeArrayFurtherValue just returns a PikeValue. So,
    # we actually don't want to change $item[1], but we need to
    # unroll $item[2] before returning it. Actually, we return an
    # array reference with the whole thing.
    $return = [ $item[1], @{ $item[2] } ] ;
  }

PikeArrayFurtherValue:	PikeStructureSeparator PikeValue
  {
    # $item[2] is a PikeValue, hence:
    # - a reference to an array or hash (if PikeStructure)
    # - a scalar (if QuotedString or Number)
    # We bubble it up as is.
    $return = $item[2] ;
  }

PikeMappingContent:	PikeMappingSequence(?) PikeStructureSeparator(?)
  {
    # Since we have a repetition here, $item[1] is a reference to an
    # array which may contain 0 or 1 PikeMappingSequence's.
    # In turn, PikeMappingSequence returns an hash reference.
    # So, if we want the hash reference to bubble up, we have to
    # unwrap it and return it as is.
    
    ( $return ) = @{ $item[1] } ;
  }

PikeMappingSequence:	PikeMappingPair PikeMappingFurtherPair(s?)
  {
    # $item[1] is a PikeMappingPair, hence a reference to an array
    # of two elements: a string and a PikeValue, that is:
    # - a reference to an array or hash (if Pikevalue ~ PikeStructure)
    # - a scalar (if PikeValue ~ QuotedString or Number)
    #
    # $item[2] has a repetition, so it is a reference to an array of
    # PikeMappingFurtherPair's. Since PikeMappingFurtherPair just
    # returns a PikeMappingPair (see $item[1]), then $item[2] is
    # a reference to an array where each element is, in turn, a
    # reference to an array of two elements.
    #
    # Since we are going to return an hash here, we create a reference
    # to an hash; to correctly unroll the values of $item[1] and
    # $item[2] we:
    # - simply dereference $item[1], hence unrolling the only hash
    #   pair the array contained
    # - we dereference $item[2], getting an array of arrays, and
    #   then we use map to further unroll the key/value pairs
    #
    # We then bubble up the outcome
    $return = { @{ $item[1] } , map( @$_ , @{ $item[2] } ) } ;
  }

PikeMappingPair:	QuotedString ':' PikeValue
  {
    # $item[1] is a scalar (QuotedString)
    # $item[3] is a PikeValue, hence:
    # - a reference to an array or hash (if Pikevalue ~ PikeStructure)
    # - a scalar (if PikeValue ~ QuotedString or Number)
    # We throw them up together as a single entity: a reference to an array
    $return = [ $item[1], $item[3] ] ;
  }

PikeMappingFurtherPair:	PikeStructureSeparator PikeMappingPair
  {
    # $item[2] is a PikeMappingPair, hence a reference to an array
    # containing a QuotedString (the first) and a PikeValue, hence:
    # - a reference to an array or hash (if Pikevalue ~ PikeStructure)
    # - a scalar (if PikeValue ~ QuotedString or Number)
    $return = $item[2] ;
  }

PikeValue:		PikeStructure | QuotedString | Number
  {
    # $item[1] is:
    # - a reference to an array or hash (if PikeStructure)
    # - a scalar (if QuotedString or Number)
    $return = $item[1] ;
  }

You may want to test it against a big file you'll make yourself, or maybe against this small sample snippet:

[section1]
  onehash=([ "onekey" : "onevalue" ])
  hashtest=([ "key1" : "val1", "key2" : "val2" ])
  aohtest=({
    ([ "hash1key1": "hash1val1" ]),
    ([
      "hash2key1": "hash2val1",
      "hash2key2": "hash2val2",
    ]),
    })
  string=sample string
  onearray=({ "onevalue" })

And you'll probably need a sample test script, right? This is a quick-and-dirty one, that will read the grammar from a file named g3.txt and the ini file from mockup.ini, all in the current directory.

#!/usr/bin/perl

# $Id: test.pl,v 1.8 2010/07/23 13:00:09 bronto Exp bronto $

use strict ;
use warnings ;

use Parse::RecDescent ;
use Data::Dumper ;

########################################################################
package AsIni ;

our %node ;
our $section ;
our $parmname ;
our $content ;

########################################################################
package AsIni::Parser ;

push @ARGV,0 unless @ARGV > 0 ;

eval { use constant DEBUG => $ARGV[0] } ;
die $@ if $@ ;

if (DEBUG >= 1) {
  $::RD_HINT  = 'true' ;
  $::RD_WARN  = 'true' ;

  $Data::Dumper::Indent = 1 ;

  if (DEBUG >= 2) {
    $::RD_TRACE  = 'true' ;
    $::RD_ERRORS = 'true' ;
  }
}

my $grammar = q{} ;
my $asini   = q{} ;
{
  local $/ ;
  undef $/ ;

  open my $grammarfh, '<', 'g3.txt' or die ;
  $grammar = <$grammarfh> ;


  open my $asinifh, '<', 'mockup.ini' or die ;
  $asini   = <$asinifh> ;
}

my $parser = Parse::RecDescent->new($grammar) ;
die if not defined $parser ;

my $parser_outcome = $parser->AsIni($asini) ;

print Dumper ( $parser_outcome ) if DEBUG >= 3 ;

print Dumper ( %AsIni::node  )  if DEBUG >= 0 ;

You may pass the script a debug level ranging from 0 to 3 to get some debugging information from the script or from Parse::RecDescent itself. Read in the manual about the RD_* variables.

References

  1. Parse::RecDescent tutorial by Damian Conway
  2. Documentation for Parse::RecDescent

Acknowledgements

The tree image in this post belongs to this site.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s