To: arma@MIT.EDU
Subject: My Plan for World Domination
Date: Mon, 22 Nov 1999 22:49:16 -0500
From: Nick Mathewson <nickm@MIT.EDU>

Hi, Roger.  I thought you might be able to give me some good advice
about my project.  This might be a bit long, but I want to be thorough.
I start with 
*an example of how you'd use it, then explain 
*how it fits within a larger system.  I talk a bit about 
*implementation details, then discuss a few of the more 
*neat/weird features.  Next, I explain
*what I've implemented so far.  Finally, I close with
*a bunch of questions, and
*a couple of related (but lesser :) ) projects.

The questions section has most of what I want to know about, but please
feel free to give me comments/questions on as much of the rest as you please.

Yours,
		            Nick


=== AN EXAMPLE

'Parsely' will be a generalized framework for configuration tools to
use.  Suppose you want to parse and modify lilo.conf files.  First,
you write a simple grammar for the file format.  The regexen are
perl-style.

	 Token image = "image";
	 Token boot = "boot";
	 Token timeout = "timeout";
	 Token label = "label";
	 Token EQ = "=";
	 Token EOL = re{\n+};
	 # ...
	 Token filename = re!/.*$!;
	 Token word = re!\w+!;
	 Token number = re!\d+!;

	 # Anything matching this regex is space.
	 Space re![\t \r]+!;
	 
	 # A File is a list of global declarations followed by a list of
	 # image declarations.  The former is labeled 'globals', and the
	 # latter is labeled 'sections'
	 File = globalDeclarations:globals
	        imageDeclarations:sections;

	 # globalDeclarations is a list of EOL-terminated Declarations.
	 globalDeclarations = Declaration* term=EOL;

	 # Declaration may be a bootDecl, which is of the form boot=filename.
	 # The filename is labeled 'file'.
	 #
	 # Declaration may also be a timout or a label
	 Declaration = [bootDecl] boot EQ filename:file |
		       [timeoutDecl] timeout EQ number:time |
		       [labelDecl] label EQ word:lbl;

	 # imageDeclarations is a list of imageDecl
	 imageDeclarations = imageDecl*
	 
	 # imageDecl is an image line, followed by a list of declarations.
	 imageDecl = imageLine:head Declarations:body;
	 imageLine = image EQ filename:file EOL;
	 Declarations = Declaration*;

	 # The 'File' symbol is the one which contains a whole file.
	 Start File;

Then, once you have this configuration file, you can use parsely to
view and modify your files.  Suppose you wanted to get at the filename
of the first 'image' line in your file.  You could write this in Python:

   # initialization magic here: load Lilo variable with the file format
   # object defined above
   file = Lilo.parseFile("/etc/lilo.conf")
   print file.sections[0].head.file

Or this in C++:

   ParselyNode *file = Lilo->parseFile("/etc/lilo.conf");
   cout << file["sections"][0]["head"]["file"];

Or this in C:

   struct parselyNode *file = parsely_parse_file(Lilo, "/etc/lilo.conf");
   puts(parsely_get_str(file, "sections.0.head.file"));

Or this in Perl:

   $file = Lilo->parseFile("/etc/lilo.conf");
   print $file->{"Sections"}[0]{"head"}{"file"}, "\n";

You can also change the 'file' object in the examples above, and write out
the changed version to disk.  When you do so, all comments and formatting
in the original will be preserved.

Since Parsely presents a common interface among all these languages, it
allows you to prototype your code in python or perl, and deploy it
in C or C++.  

=== WHERE IT FITS IN

Here's how I believe I would write a configuration tool.  I'd start
out with syntax, and a Parsely grammar for the file, and use Parsely
to manipulate the file itself.  Next, I'd add a layer on top of
Parsely to define semantics for that file.  Finally, I'd write a
GUI/CLI/Linuxconf module/COAS module/Gnome-Capplet to sit on top of the 
semantic layer, and present an interface to the user.

For example, once I'd written the lilo.conf syntax, I'd define functions
like set_timeout, get_timeout, add_image, get_image, set_default_image,
and so forth to manipulate the semantic information of the file. 

Next, I'd use something like glade (http://glade.pn.org) to build a GUI
to sit on top of the semantic layer.

   INTERFACE
       I
   SEMANTIC
       I 
   SYNTACTIC

There are lots of strategic advantages to this system.  First, by
making Parsely so retargetable, members of every configuration project
there is could leverage it to handle new file formats.  Second, the
config file itself would be usable by all the projects; once somebody
handles bash right, everybody is closer to doing so.  Third, if the
semantic layer is written properly, it too can be shared; if you wrote
one in C, then linuxconf(C++), COAS(C++), Gnome(C/C++/python),
GECCO(C++?) and CfgTie(perl) would all be able to use it.  Fourth,
Parsely lowers the barier to entry for new configuration tools; if you
hate the interface and overhead BCT(Bob's configuration tool), but BCT
has the perfect Parsely-based sendmail module, you can simply copy
that module to your own code!

=== IMPLEMENTATION DETAILS

Parsely is implemented in Python, but absolutely no knowledge of Python
is required to use it.  When you ask it to generate code for C,C++, or Perl,
it will parse your grammar file, and generate the code necessary to
read the configuration file in question.

In Python and Perl, scanning (i.e., breaking the input up into tokens)
is handled directly through pcre, perl's regex library.  In C and C++,
scanning is done by _either_ pcre (which is more powerful), or flex
(which is far, far faster).  In Python and Perl, we use the Earley
parsing algorithm, which runs in worst-case O(n^3) time where n is the
length of the input, but which can handle any kind of grammar.  In
C and C++, we parse with either the Early algorithm, or with a parser
generated by bison (which runs in O(n) time, but can only handle
grammars in the LALR class).

In practise, I believe that the O(n^3) case seldom occurs even with
the slower algorithm, except with very pathological inputs.  Nevertheless,
since efficiency is a concern to many people, and since writing LALR
grammars isn't that hard, I want to include the Bison option as well.

Files are represented internally as typed trees.  Every leaf of the
tree contains a token, and the space immediately after that token.

Within Perl and Python, nodes of the trees look just like native
objects.  Within C, they're implented in terms of glib (since the GNOME
people seem to be the biggest desktop project using C).  Within C++,
they're implemented using C++'s STL.

=== FEATUES

* You can build indices of nodes in the tree.  

  Suppose you wanted a quick way to find the section of a lilo.conf file
  with a given label.  You could write in Python:

  index = buildIndex(file, 'sections.*', # Make an index to members of sections
		     '*.lbl' # indexed by the label element of any member
		             # of that section
                    )
  
  and access the section named "Linux" as:
  index["Linux"]

* You can tell the scanner to enter various states.  For instance, suppose
  you wanted to have a key-value file, where the keys must be foo or
  bar, but the values may be anything up to the newline.  If you wrote:
       Token Key = re{foo|bar};
       Token Eq = "=";
       Token Val = re/.*$/;
       Token EOL = "\n";
  then you'd have the problem that the line "foo=bar" would be scanned as
  [Key Eq Key EOL] instead of [Key Eq Val EOL].

  One way to solve this problem is to use scanner states:
       Start State KeyState;
       State ValState;
       Token Key = re{foo|bar} in KeyState;
       Token Eq = "=" in KeyState enter ValState;
       Token Val = re/.*$/ in ValState;
       Token EOL = "\n" in ValState enter KeyState;
  Under this configuration, the scanner begins in 'KeyState'.  In this
  state, it recognizes Key and Eq.  When it sees Eq, it enters ValState.
  In ValState, it puts everything up to the EOL into a Val, and returns
  to KeyState on EOL.
  
* You can put 'actions' in the scanner and the parser.  An action is a
  piece of code that is executed whenever a rule matches.  Within an
  action, you can call special functions to modify the behavior of the
  scanner or parser, or to call other functions within your program.

  Actions need to be specific for a single language.  Here's how you
  might use actions in C to match perl-style regexen

	State regex;
	Start State S;
	Token reBegin = re're.' in S enter regex action=beginRegex;
	Token reEsc = re'\\.' in regex action=regexEsc;
	Token rePush = '{[(' in regex action=regexPush;
	Token rePop = '}])' in regex action=regexPop;
	Token reChar = '.' in regex action=regexChar;

	Action beginRegex in C = <<EOL
	/* Set closing character. */
	closingChar = MATCH[2];
	if (closingChar == '(') closingChar = ')'; /* And so on... */
        depth = 0;
        EOL

	Action regexEsc in C = <<EOL
	/* The ACCUMULATE pseudo-function attatches this match to the
	 * end of the previous token. */
	ACCUMULATE();
        EOL

	Action regexPush in C = <<EOL
	++depth;
	ACCUMULATE();
	EOL

	Action regexPop in C = <<EOL
	--depth;
	ACCUMULATE();
	if (MATCH[0] == closingChar && depth==0) ENTER_STATE(S);
	EOL

	Action regexChar in C = <<EOL
	ACCUMULATE();
	if (MATCH[0] == closingChar && depth==0) ENTER_STATE(S);
	EOL
	
* Within actions, you can include other files.

* Within actions, you can 'alias' other tokens.  This is useful for
  parsing things like bash .profiles.  If you see "alias a alias",
  then from then on, you want to treat 'a' as though it were
  the same as alias -- otherwise, you'd miss any of the user's
  aliases which she defined using 'a'.  Here's how you can do that:

	Token alias = "alias";
	Token word = re/\w+/;
	Token EOL = "\n";
	aliasDecl = alias word:first word:second EOL action=handleAlias;
	Action handleAlias in C = <<EOL
	if (!strcmp(second,"alias")) {
	    REINTERPRET("word",   /*Whenever we see a 'word'*/
		        "second", /*The same as 'second'*/
			"alias",  /*It's really of type 'alias'*/
                       );
        }
	EOL	

* Parsely will ship with a GTK+-based tree browser that will let you
  examine the parse tree of a given file.

* When you generate C or C++ code, Parsely will also allow you to generate 
  perl and python wrappers for that code.

=== WHERE I AM

* I have everything written for python-only operation, except for includes. 
  All in all, it's about 4.5kloc.
* Except as noted below, everything has test cases.
* I need to polish some stuff a little bit more. 
* I need to write test cases for actions, includes, and reinterprets
* My code is documented, but could use move verbosity.
* I have no high-level user or developer documentation.  This email is
  the first place I've put my thoughts down for anybody but me.
* It needs a front-end, command-line version.
* The configuration-file format needs more work.
* I have no examples written.  I'd like to get at least tcsh, bash, 
  fvwm2, and procmail written.
* I have absolutely no code written for C, C++, or Perl.  In C and C++,
  I have no implementation for the Earley (slow-but-general) Perser, or
  the perl-style (slow-but-general) regexen.  In Perl, I have no Earley
  parser.

=== QUESTIONS I HAVE

* Do you think this tool will be useful?

* What was confusing or unclear in the above exposition?

* Do you think that a non-expert programmer could learn to use a tool
  like this?

* What design decisions above do you believe to be poor?  Anything
  else I should consider?

* Of the unfinished portions in the previous section, which do you believe
  must be finished before the first developer release?  Why?  (Please
  consider partial completion where applicable. If C is necessary, is
  it necessary to include both [fast/general] types of scanner and lexer?)

* I plan to licence the whole thing, including the code it generates,
  under the LGPL.  Do you think it's worthwhile to try to licence some of
  it under the GPL instead?  I choose the LGPL, btw, in order to avoid
  clashing with other licenses.

* Is Parsely a good name?  Do you know anyone who could suggest a better one?

* How should I announce?  Fast or slow?  Wide or narrow?

* Do you know anybody who'd like to hack on this?

* Do you know anybody else (people, not lists) who could give me good info
  about these questions?

=== SIMILAR PROJECTS:

  http://www.larsshack.org/sw/wyg/
	- This is useful only for parsing flat configuration files.
	  It doesn't write back.  Not a competitor.
  http://kingfig.netpedia.net
	- This guy's work is written in C, and seems to want to do the
	  same thing as parsely.  He can't modify files yet, and hasn't
	  done a release since May.  He has about 2.5 kloc written.
	  
	  His tool mixes syntax with semantics.  It doesn't appear to
	  support general parsers, but I can't really tell, since he's
	  not yet documented his format.  His code is not documented,
	  and peppered with global variables.


==============================================================================

To: Roger R Dingledine <arma@MIT.EDU>
Subject: Re: Call for Comments, Universal Parser 
In-Reply-To: Your message of "Sat, 27 Nov 1999 06:54:16 EST."
             <199911271154.GAA20400@belegost.mit.edu> 
Date: Sat, 27 Nov 1999 13:05:44 -0500
From: Nick Mathewson <nickm@MIT.EDU>


Roger wrote:
 [...]
>>	 imageDeclarations = imageDecl*
>
>missing a semicolon?

Yep.  The grammar isn't really finalized yet; my parser doesn't enforce
the semicolon properly.

 [...]
>I don't have a good idea from this example of how brittle your parser
>is. I read the lines about and think they're relatively intuitive to
>me in terms of reading, but I'm not sure if they're intuitive in terms
>of writing.

It's pretty robust, I think.  The only tricky bit is the fact that
production forms can't be nested.  In other words, you can't write
things like:
      Name = Identifier* LongIdentifier+:ids;
   or Name = Sep(Identifier, comma+);
   or Name = [Full] Identifier+ | [Short] Identifier; 
This is necessary for us to correctly identify types for each expression.

I don't believe this is a serious shortcoming, but others may disagree.

[...]
>>the changed version to disk.  When you do so, all comments and formatting
>>in the original will be preserved.
>
>Did your above example have a "these are the ways I do comments" declaration?

Sorry; Parsely treats comments as just another kind of space, so you could
write something like:

      Space re/#.*$/;   # For perl-style comments
      Space re!/\*
               (\*[^/]|[^*])*
               \*/!x;   # for C-style comments 

Do you forsee any problem with this?

 [...]
>How easy would it be to transition from current configuration management
>tools to using Parsely?

Okay, here's the big misunderstanding.  Parsely does _not_ replace any
existing tool.  What it does, is make it easier to write modules for _any_
configuration tool, present or future.  As such, the transition would
be an issue for tool developers, and hopefully invisible to the users.

In my opinion, earlier projects of this nature have made several serious
errors.  One was attempting to compete with existing configuration projects.  
Parsely doesn't compete; it facilitates.  Because you can use parsely
to generate linuxconf modules, linuxconf developers ought to be happy
that we're making their job easier.

Keep in mind that a parsely module is useless on its own. Users don't
want to type
          'images['linux'].append("linear\n");'
in order to force lilo to run in linear mode; they'd rather check a box
on a GUI.  Similarly, the developers of the GUI don't even want to say
          'images['linux'].append("linear\n");'
.  That's too low-level.  They'd rather write 
          'set_image_linear('linux', 1);'

 [...]
>Is there a good way to macroize this, if this is a commonly occuring
>situation?

I'm thinking of adding a library of common 'tricky' lexemes, and providing
import functionality.  I already support macros in python, but they're
pretty clunky.  In any case, this is not version 0.1 functionality. :)

 [...]
>>* I have no examples written.  I'd like to get at least tcsh, bash, 
>>  fvwm2, and procmail written.
>
>Sounds good. Doing examples for some end-user-oriented apps like samba
>or a windowmanager might be neat too.

Hm.  I was under the impression that decent samba configuration
engines already existed.  (I might be wrong.)  While it would be nice
to get parsely modules written for everything, I'd like to start with
some of the more patholgical cases, so as to exhibit Parsely's 
flexibility.  

>>=== QUESTIONS I HAVE
>>
>>* Do you think this tool will be useful?
>
>Yes. The only concern I have is activation energy for people to
>transition over to it. It has to be better than what they're currently
>using for users to care much. It has to be easier to work with (and
>extend) than the current ones for programmers to care much.

As noted above, the users won't notice a thing.  All they'll realize
is that their configuration tools are suddenly supporting things
they didn't support before, and becoming much more robust, and more
alike in they way they support varying file formats.

As for the programmers, I doubt they'll change the ways they handle
simple things like kvl files or windows-style .ini files, or even
lilo.conf.  But I believe that when they set out to write code to
handle a new format, or when they want to handle a format too
complex for an ad-hoc parser to handle, they'll look to parsely.

I personally feel that it's far, far easier to write a parsely description
of even lilo.conf than it is to write code to parse it and write it back
while preserving formatting.  If you know perl-style regular expressions
and context-free grammars already, then the learning curve shouldn't
be too steep at all.

Another barier  to adoption is the fact that hand-written scanners
and parsers (especially parsers) have the potential to be smaller and 
faster than generated ones.  My only counterargument here is to claim that,
since  I expect the genereated .o files for any given grammar to be around
30 to 60K, and since (in my experience)   speed is perfectly acceptable even
on a P133, there is no real reason to sacrifice maintainability for unneeded
speed.

>>* Of the unfinished portions in the previous section, which do you believe
>>  must be finished before the first developer release?  Why?  (Please
>>  consider partial completion where applicable. If C is necessary, is
>>  it necessary to include both [fast/general] types of scanner and lexer?)
>
>I think having "sufficient" functionality in any of those languages is
>good enough for release.

Do you mean that I should release once Python is working well enough,
or once Python and at least one other language is working well enough,
or once every language is working well enough?

==============================================================================