Parsely Users Manual -- for version 0.1.0.

Last updated 10 Apr 2000.
(c)2000 Nick Mathewson.  I'll probably release this document under the
GNU Free Documentation License, as soon as I understand it. ;)  For now,
treat is as being GPL.

Please contact nickm@mit.edu with any questions.

Contents

1. Introduction, and a simple example.
1.1 An Example
1.1.1 Analyzing the file format 
1.1.2 Writing the description file
1.1.3 Using the Parsely module.
2. Advanced features: scanning and parsing 
2.1 Token defaults
2.2 Scanner states
2.4 Patterns
2.5 Additional parser rules: 'separated' and 'terminated' 
2.6 Actions
2.7 Macros
2.8 Global options
2.9 Details about the scanner and parser
3. Advanced features: manipulating nodes

1. Introduction, and a simple example.

Unix configuration tools are hard to write, and harder still to write
well.  One of the largest problems is that most tools, daemons, and
system services all store their configuration data in different
text-based formats.  A good configuration tool must be able to read
these files, modify them as needed, and write them back without
changing the user's previous comments and formatting. Parsely is a
tool that automates this process.

You can use Parsely in your own projects as follows:

    1. Write a description of the file format.  Parsely will use this
       description to generate a scanner and parser that will read the
       files you want to parse, and expose the parse tree as a data
       structures.

    2. Write code to manipulate the data tree, in order to make the
       changes needed by your configuration tool.

Parsely will take care of whitespace and comments for you, and make
sure that you don't change any parts of the file unintentionally.

[Parsely is currently implemented in Python, but will be ported to C by
the end of May.  This manual assumes that you know Python, and that
you're also familiar with Perl's regular expression syntax.]

1.1 An Example

SSH is a tool that lets users communicate securely across an untrusted
network.  It has a relatively simple configuration file format, with
more options than most of its users can understand, much less use
effectively.  Let's write the backend for an SSH configuration tool.

1.1.1 Analyzing the file format

First, we need to consider the format we want to be able to handle.
The configuration file looks a bit like this:

       # These options apply to all hosts
       Keyword1  Option1
       Keyword2 = Option2

       # These options also apply to all hosts
       Host *
          Kwd3 Option3
          Kwd4 Option4

       # These options only apply to the machine 'frunobulax.seul.org'
       Host frunobulax.seul.org
          Kwd5 = Option5.1 Option5.2

As we see above, the file first has a number of options not attached to
any host, and then has a number of Host sections.  Each section contains
a number of option declarations.

Option declarations are of the form:
    Keyword Value
or of the form:
    Keyword = Value
The exact format of 'Value' depends on the keyword in question.  Here,
we'll just treat every value as a string or a list of strings.

1.1.2 Writing the description file

Our second step is to write the file 'ssh_config.ply'.  Lines
following the # marks below are comments, and explain the meaning of
each part of the file.

#### Beginning of ssh_config.ply  (by convention, we use the .ply ending.)
####
# Tokens
#
# First, we need to describe all the possible tokens that can occur
# in the configuration file.  We do this using 'token' declarations.
####

# This declares that 'Host' is a token matching a given pattern.  The
# regular expression syntax matches perl's very closely.  Specifically,
# * You can write regular expressions as /x/, re/x/, m/x/, re{x},
#   m[x], and so forth.
# * You can follow the regular expression with 'i' or 'x' flags, which
#   mean the same as they do in perl.
# * There are some differences between us and Perl.  See section 2.9.
token Host = re/Host\b/i;

# You can also express tokens as string literals.  Parsely accepts strings
# in the following formats:
# * Python style: 'x', "x", '''x''', """x"""
# * Perl style: q(x), <<EOF x EOF
# Additionally, you can put an 'i' flag after a string to make it case-
# insensitive.
token NL = "\n";

# Instead of naming a string token, you can just write it as follows.  This
# is especially convenient for single-character tokens.
token "=";

# (In this file format, we don't have a separate token for every possible
#  option.  We just use the 'Word' token to handle every case but 'Host'.)
token Word = re/[^\s\n]+/;

####
# Space
# 
# Besides the tokens, Parsely's scanner also needs patterns for the space
# in a file.  
####

# SSH allows shell-style comments...
space /^#.*$/;
# ...And ordinary space.
space /\s+/;

# It's also useful to specify the 'default space' in a file so that 
# when we insert new spaces into the file, they have a reasonable appearance.
#
# By default, newly-inserted spaces will look like this:
default space " ";

###
# Grammar
# 
# Last, we specify the way that the tokens fit together to make an entire
# file.
###

# The 'start' statement declares the start symbol of the grammar.  This means
# that the entire file must be parsed as a 'File' symbol below.
start File;

# There are three basic types of rule declarations.  The first is 
# a sequence:  We declare that a 'File' is made of an 'OptEOL', an 'Options',
# and a 'Sections'. 
#
# We also name the 'Options' as 'initial', and the 'Sections' as
# 'hosts'.  This will allow us to later refer to the parts of a file 'f' as
# 'f.initial' and 'f.hosts'.
File = OptEOL Options:initial Sections:hosts ;

# The second kind of rule declaration is a list: we say that a 'Sections'
# node is a list of 0 or more 'Section' nodes.
#
# Acceptable multiplicities are:
#     *     (0 or more)
#     +     (1 or more)
#     ?     (0 or 1)
#     N...  (N or more, where N is an integer.)
Sections = Section*;

# We declare a few more rules to describe how individual sections will appear.
Section = HostDecl:hostDecl Options:body ;
HostDecl = Host Word:host EOL ;
Options = Option * ;

# The third kind of rule declaration is a set of alternatives.  This
# declares that an Option may take the form of one of four sequences.
# 
# The four sequences each have a tag, written as [tagName].  This allows
# us to refer to the sub-rules as 'Option:eqSingle', 'Option:eqMany', and
# so on.
Option = [eqSingle] Word:k "=" AnyWord:v  EOL |
         [eqMany]   Word:k "=" AnyWords:v EOL |
         [single]   Word:k     AnyWord:v  EOL |
         [many]     Word:k     AnyWords:v EOL;

# We have the AnyWord production because, in the SSH configuration
# format, we can have the value of an option be 'Host' or '='. It's
# possible, for example, to have a host named 'Host', or a file named
# '='.  This declaration solves that problem.
AnyWord = Word:val | "Host":val | "=":val ;
# 2 or more instances of AnyWord.
AnyWords = AnyWord 2 ...;
EOL = NL+;
OptEOL = EOL?;

#### End of ssh_config.ply.

1.1.3 Using the Parsely module.

After we've written the ssh_config.ply description above, we can load
it within Python as follows:

   import parsely
   ssh_file_format = parsely.loadFormat('ssh_config')
   typesys = file.getTypeSystem()
     
Next, we can use it to load and parse a file:

   file = ssh_file_format.parseFile('/etc/ssh/ssh_config')

If this succeeds, we can manipulate 'file' as a nested structure of 
lists and objects:

   print file.hosts[0]               # Prints the first host section.
   file.hosts[0].hostDecl.host = '*' # Sets the host in the first section to *

   opt = typesys.newNode('Option:single')  # Make a new Option:single node
   opt.k = 'Cipher'                  # Set up the newly created node...
   opt.v = 'blowfish'
   file.hosts[0].body.append(opt)    # ... and append it to the first section.

   print file.dump(trailingSpace=1)  # Prints the entire file.

When we're done, we can flush the modified file back to disk.

   file.flush()

For a more complete example of how to manipulate ssh_config files, see
examples/ssh/ssh_config.py in the Parsely distribution.

2. Advanced features: scanning and parsing

This section describes more advanced features of scanning and parsing,
not described in the previous section.  To understand this material, it is 
very important that you first understand the example above.

2.1 Token defaults

When you're modifying a parsed file, and ask Parsely to create a new
token node without specifying its value, Parsely will ordinarily guess
a default value for the token.  The guessed value is _usually_ within
reason, but sometimes you'd like to specify it yourself.

For example, if you define word as follows:
       token Word = m/\w+/;
and create a new Word: 
       w = file_format.getTypeSystem().newNode('Word')
then the newly created word will have the value 'a'.  

If you'd like a different default value, you can specify it like this:
       token Word = m/\w+/ default "unspecified_word";
Now, newly created words will have the default value 'unspecified_word'.

2.2 Scanner states

Sometimes it is useful for the scanner to behave differently under
different circumstances.  For example, suppose that you're parsing a file
that looks like this:

     set a=b
     flush
     set poodle=mutant
     set thesecretword=set
     flush
     
Look closely at the fourth line.  Although 'set' is clearly a special
word in this file format, it can also behave as an ordinary word when
it doesn't appear at the beginning of the line.  If we wrote something like:
     Declaration = Set Word '=' Word NL |
                   Flush ;
then the fourth line above would be erroneously rejected.

We can solve this problem as in the ssh_config example above, using
something like this:
     AnyWord = Word | Set | Flush;
     Declaration = Set AnyWord '=' AnyWord NL |
                   Flush NL;

Another solution is to use scanner states, as shown below:

     # We begin in state 'COMMAND'.
     start state COMMAND;
     # There's also a state called 'VALUE'.
     state VALUE;

     # We only recognize 'Set' and 'Flush' tokens in the COMMAND state.
     # When we encounter one, we enter the VALUE state.
     token Set   = "set"   in COMMAND to VALUE;
     token Flush = "flush" in COMMAND to VALUE;

     # We only recognize 'Word' tokens in the VALUE state.  When we see
     # one, we stay in VALUE.
     token Word  = re/\w+/ in VALUE;
     token "="             in VALUE;

     # We recognize spaces in all states, and don't change our state on
     # seeing one.
     space /\s+/;
     # (We could also write this as: space /\s+/ in VALUE, COMMAND;)

     # Finally, we recognize newlines in any state.  When we see one,
     # we go back to COMMAND.
     token NL = "\n"       to COMMAND;

     # Now our rules are nice and simple!
     start File;
     File = Declaration*;
     Declaration = Flush                 |
                   Set Word "=" Word NL  ;

* By default, token and space declarations with no "in" specifier are accepted
  in every state.  To create a state that will only accept tokens and spaces
  explicitly declared to appear in it, declare that space as 'exclusive':
       exclusive state S1;
       state S2;

       token "A" in S1;
       token "B" in S2;
       token "C";
  Now S1 will accept the token "A", while S2 will accept both "B" and "C".

* Exactly one state is the start state.  If no start state is declared, the
  scanner begins in an state called "INITIAL".

2.4 Patterns

Sometimes, it's inconvenient to declare a token or space all at once.  For
example, here is a pattern to match Python strings, that borders on
unreadable.

       token PyString = 
             re/'([^'\n\\]+|\\.)*'|"(^["\n\\]+|\\.)*"/;

One way to simplify it is to comment it using the 'x' flag:

       token PyString = re{ 
            ' ( # opening quote, followed by any number of:
               [^'\n\\]+  # characters other than ' \ and newline
              | \\.       # and escaped characters.
              )*
            ' # closing quote
          | " ( # opening quote, followed by any number of:
               [^"\n\\]+  # characters other than " \ and newline
              | \\.       # and escaped characters.
              )*
            " # closing quote
          }x;

But even with this approach, regular expressions can grow overlong.
To solve this problem, you can define named subpatterns, and then use them
to build up a complete regular expression:

       # Part of the body of a single-quoted string
       pattern SQBody   = re/[^'\n\\]+|\\./;
       # A single-quoted string
       pattern SQString = re/'                # An opening quote
                             {:SQBody:}*      # (Note the pattern reference!)
                             '/x;             # A closing quote
       # Analogous definitions for double-quote strings
       pattern DQBody   = re/[^"\n\\]+|\\./;
       pattern DQString = re/"{:DQBody:}*"/;
       token PyString   = re/{:SQBody:}|{:DQBody:}/;

This approach is especially useful when one subpattern is needed in several
token definitions.

2.5 Additional parser rules: 'separated' and 'terminated'

It's very common in many file formats for varying elements to appear
alternating with one another.  For example, you often see files like
       "Element1   comma     Element2   comma     Element3"
    or "Statement1 semicolon Statement2 semicolon".  

You can declare rules to match this kind of production:
       ElementList   = comma separated Element +;
       StatementList = semicolon terminated Statement *;

The production with 'separated' will match 1 or more Element items
with commas betwen them.  The production with 'terminated' will match
zero or more Statement items with a semicolon _after_ each one.

As we've written the rules above, if you examined the members of an
ElementList el, you'd find that el[0] would be an element, el[1] would
be a comma, el[2] would be another element, and so forth.  Sometimes,
we'd like to ignore the commas.  To do this, we can use the
'exclusive' modifier:
       ElementList   = exclusive comma separated Element +;
       StatementList = exclusive semicolon terminated Statement *;

Now, the first element will be el[0], the second will be el[1], and
so forth.

2.6 Actions

There are times during scanning and parsing when ordinary regular
expressions and parsing rules are not enough.  For example, consider
a simplified version Perl's regular expressions:
                re(abcd(efg)hi(jkl))
We'd like to scan these elements as single tokens, but because we need to
count the parenthesis to do so, regular expressions alone cannot help us.

Actions are the answer.  They allow us to attach a pieces of code to
tokens and rules.  A solution to the regex problem is as follows.

        exclusive state REGEX;

        # When we see 're(', begin a new regular expression.
        # Enter the REGEX state, and run the begin_re action.
        token regexBegin = re/re\(/ to REGEX action begin_re;
        action begin_re in Python:
            self.parenCount = 1

        # When we see more stuff in a regular expression, check whether
        # it's a parenthesis, and attach it to the end of the last token.
        token regexMore = re/./ in REGEX action continue_re;
        action continue_re in Python:
	    # The current match is accessable in self.match.
            if self.match == '(':
               self.parenCount = self.parenCount + 1
            elif self.match == ')':
               self.parenCount = self.parenCount - 1
               if self.parenCount == 0:
                   # Leave REGEX
                   self.enterState('INITIAL')
            # In any case, attach the current unit to the end of the last
	    # token.
            self.accumulate()
             
To attach an action to a token, space, or rule, place 
'action actionName' at the end of its declaration.

To declare the action itself in python, you use one of the format below:
        action actionName in python:
            BLOCK
    or
        action actionName (type) in python:
            BLOCK

If (type) is provided, it must be one of the following:
   * initScan/initParse: The action will run automatically at the
     beginning of scanning or parsing.
   * finishScan/finishParse: The action will run automatically at the
     end of scanning or parsing.
   * scanFn/parseFn: The action will be available as a function for other
     scanner/parser actions, even if it is never used by a token or rule.

The code for an action is compiled as though it were the body of a
method.  Code for scanner actions has access to the following fields
and methods:

 FIELDS   
   self.match
        The current match. (String)
   self.matchLen
        The length of the current match. (Integer)
   self.type
        The name of the current token, or None for a space. (String/None)
 METHODS
   self.getLineNumber()
        Returns the current line number
   self.accumulate()
        Appends the current token or space to the end of the last token.
        This method allows you to match a single token piece by piece.
   self.less(l=1)
        Decreases the length of the current match by 'l' characters.  For
        instance, if the current match is 'abc', and l=1, then the current
        token will receive the value 'ab', and the scanner will look for
        a 'c' at the beginning of the next token.
   self.more(l=1)
        Increases the length of the current match to include the next 
        character on the input stream.
   self.rest()
        Returns a string containing all input not yet processed.
   self.setType(t)
        Sets the type of the current token or space to be the one named by 't'.
   self.enterState(s)
        Changes the current scanner state to 's'.
   self.error(s,fatality="FATAL"):
        Reports an error. FATAL errors stop the scanner immediately.
        ERROR errors prevent parsing, but do not stop the scanner.
        WARN errors print an error, but do nothing else.
   self.becomeSpace()
        Changes the currently scanned token into a space.

   self.reinterpret(tokenType, matching, as)
        This method is tricky, but powerful.  It causes all future 
        references to tokens of type 'tokenType' which match the regular
        expression or predicate 'matching' to be reinterpreted as the
        type 'as'.

        You can use it to implement file formats which have a form of
        alias directive.  For example, suppose you're parsing a .tcshrc
        when you see 'alias s set'.  This means that you want to reinterpret
        all 'command' tokens which are equal to 's' as being 'set' tokens.
        You can do this with:
               self.reinterpret('command', r's', 'set')

Code for parser actions has access to the following fields and
methods:

 FIELDS
   self.args
      A list containing all sub-nodes of the current node.
   self.match
      The currently-created node.
   self.type
      The type of the current node.

 METHODS
   self.error(s)
      Raises an error and stops parsing.
   self.getLineNumber()
      Returns current line number.

2.7 Macros

Parsely description files provide a very powerful macro-expansion
feature based on Python code.  You can use macros to build complex
configuration files, while keeping the rules short.

Here is a trivial example of using a macro to declare a bunch of tokens
at the same time:
        python:
            import string

        defmacro tokens(s):
	    out = []
            for item in string.split(s):
                out.append("token %s = re/%s\\b/;" % (item,item))
            return string.join(out, "\n")

        tokens("""alias set setenv bindkey""")
	# (The above expands to:
	#  token alias = re/alias\b/;
	#  token set = re/set\b/;
	#  token setenv = re/setenv\b/;
	#  token bindkey = re/bindkey\b/;).

Macros run in a python namespace of their own, but have access to
other modules as needed.  Macros should appear at the beginning of their
own lines.  The macro syntax is as follows:

    python:
        BLOCK
      The code here is run in the macro namespace.  You can use these
      blocks to initialize variables, import python modules, and so forth.
 
    import modulename
      Imports the python module 'modulename' into the macro namespace.
      Imports all functions listed in modulename.PARSELY_MACROS as macros.

    def fnname(argumentlist):
        BLOCK
      Defines a function in macro namespace.  Does _not_ define a macro.

    defmacro fnname(argumentlist):
      Defines a function in the macro namespace.  This function is available
      as a macro.

    macroname(arguments)
      Calls the given macro with the provided arguments.  First, evaluates
      the arguments in Python.  Next, passes them to the macro function.  If 
      the function returns a string or a list of strings, inserts them
      into the input stream.

2.8 Global options

The file description may contain options that affect the scanner or parser
as a whole.  As of this writing, only two are supported:

      option nocase;
         All tokens are treated as case-sensitive.
      option dotall;
         The /./ regular expression matches newline.

2.9 Details about the scanner and parser

This section has a few implementation notes on the scanner and parser.

The scanner allows most of perl's regular expressions, except for:
    * backreferences (\1 \2 \3 \4 ...)
    * beginning and end markers (\A and \Z)
    * Named patterns (?P....)

The parser uses John Aycock's SPARK package, and can parse any
context-free language.

The current implementation is _not_ tuned for performance.  That will
come with the C module.  Still, the speed seems acceptable for our
intended uses.

3. Advanced features: manipulating nodes

This section is a quick-reference.  For more documentation, see the
docstrings in the source code.

* To read a file format from the disk, use parsely.loadFile or
  parsely.grammar.generateFileFormat.

* For information on using the format object, see the methods of 
  parsely.format.FileFormat.  The ones you need are getTypeSystem,
  parse, and parseFile.

* For more information on how to manipulate nodes, see parsely.tree.
  
* To create new nodes, use the newNode method of parsely.tree.TypeSystem

* To examine modify nodes, use a method on the appropriate subclass of
  parsely.tree.Node.

  For displaying nodes:
     dump
     prettyPrint
     sExpression

  For copying nodes:
     clone

  To inspect a node:
     getType
     isLeaf
     getTrailingSpace
     getLineNumber

  To apply a function to every descendant of a node:
     traverse

  For leaf nodes:
       getSpace  (gets the space after the node.)
       setSpace
       get
       set
      
  For list nodes:
       (all list methods) 

  For file nodes:
       getInitialSpace
       setInitialSpace
       flush   (writes the file back to disk)

* All nodes support 'getPath' and 'setPath' functions.  These take
  'paths' which describe the location of a node or set of nodes as a
  period-separated list of identifiers, numbers, and stars.
  
  For example,  x.getPath('a.3.b') == x.a[3].b
                x.getPath('a.*.b') == [ y.b, for every y in x.a ]

* The function parsely.tree.buildIndex can construct dictionaries of
  various nodes based on their contents.  For example,
           d = buildIndex(node, "a.*", "b")
  constructs a dictionary such that every element 'node.a.X' is a
  value of 'd' for the key 'node.a.X.b'.

  See the docstring for more information.