To: arma@MIT.EDU Subject: My Plan for World Domination Date: Mon, 22 Nov 1999 22:49:16 -0500 From: Nick Mathewson Hi, Roger. I thought you might be able to give me some good advice about my project. This might be a bit long, but I want to be thorough. I start with *an example of how you'd use it, then explain *how it fits within a larger system. I talk a bit about *implementation details, then discuss a few of the more *neat/weird features. Next, I explain *what I've implemented so far. Finally, I close with *a bunch of questions, and *a couple of related (but lesser :) ) projects. The questions section has most of what I want to know about, but please feel free to give me comments/questions on as much of the rest as you please. Yours, Nick === AN EXAMPLE 'Parsely' will be a generalized framework for configuration tools to use. Suppose you want to parse and modify lilo.conf files. First, you write a simple grammar for the file format. The regexen are perl-style. Token image = "image"; Token boot = "boot"; Token timeout = "timeout"; Token label = "label"; Token EQ = "="; Token EOL = re{\n+}; # ... Token filename = re!/.*$!; Token word = re!\w+!; Token number = re!\d+!; # Anything matching this regex is space. Space re![\t \r]+!; # A File is a list of global declarations followed by a list of # image declarations. The former is labeled 'globals', and the # latter is labeled 'sections' File = globalDeclarations:globals imageDeclarations:sections; # globalDeclarations is a list of EOL-terminated Declarations. globalDeclarations = Declaration* term=EOL; # Declaration may be a bootDecl, which is of the form boot=filename. # The filename is labeled 'file'. # # Declaration may also be a timout or a label Declaration = [bootDecl] boot EQ filename:file | [timeoutDecl] timeout EQ number:time | [labelDecl] label EQ word:lbl; # imageDeclarations is a list of imageDecl imageDeclarations = imageDecl* # imageDecl is an image line, followed by a list of declarations. imageDecl = imageLine:head Declarations:body; imageLine = image EQ filename:file EOL; Declarations = Declaration*; # The 'File' symbol is the one which contains a whole file. Start File; Then, once you have this configuration file, you can use parsely to view and modify your files. Suppose you wanted to get at the filename of the first 'image' line in your file. You could write this in Python: # initialization magic here: load Lilo variable with the file format # object defined above file = Lilo.parseFile("/etc/lilo.conf") print file.sections[0].head.file Or this in C++: ParselyNode *file = Lilo->parseFile("/etc/lilo.conf"); cout << file["sections"][0]["head"]["file"]; Or this in C: struct parselyNode *file = parsely_parse_file(Lilo, "/etc/lilo.conf"); puts(parsely_get_str(file, "sections.0.head.file")); Or this in Perl: $file = Lilo->parseFile("/etc/lilo.conf"); print $file->{"Sections"}[0]{"head"}{"file"}, "\n"; You can also change the 'file' object in the examples above, and write out the changed version to disk. When you do so, all comments and formatting in the original will be preserved. Since Parsely presents a common interface among all these languages, it allows you to prototype your code in python or perl, and deploy it in C or C++. === WHERE IT FITS IN Here's how I believe I would write a configuration tool. I'd start out with syntax, and a Parsely grammar for the file, and use Parsely to manipulate the file itself. Next, I'd add a layer on top of Parsely to define semantics for that file. Finally, I'd write a GUI/CLI/Linuxconf module/COAS module/Gnome-Capplet to sit on top of the semantic layer, and present an interface to the user. For example, once I'd written the lilo.conf syntax, I'd define functions like set_timeout, get_timeout, add_image, get_image, set_default_image, and so forth to manipulate the semantic information of the file. Next, I'd use something like glade (http://glade.pn.org) to build a GUI to sit on top of the semantic layer. INTERFACE I SEMANTIC I SYNTACTIC There are lots of strategic advantages to this system. First, by making Parsely so retargetable, members of every configuration project there is could leverage it to handle new file formats. Second, the config file itself would be usable by all the projects; once somebody handles bash right, everybody is closer to doing so. Third, if the semantic layer is written properly, it too can be shared; if you wrote one in C, then linuxconf(C++), COAS(C++), Gnome(C/C++/python), GECCO(C++?) and CfgTie(perl) would all be able to use it. Fourth, Parsely lowers the barier to entry for new configuration tools; if you hate the interface and overhead BCT(Bob's configuration tool), but BCT has the perfect Parsely-based sendmail module, you can simply copy that module to your own code! === IMPLEMENTATION DETAILS Parsely is implemented in Python, but absolutely no knowledge of Python is required to use it. When you ask it to generate code for C,C++, or Perl, it will parse your grammar file, and generate the code necessary to read the configuration file in question. In Python and Perl, scanning (i.e., breaking the input up into tokens) is handled directly through pcre, perl's regex library. In C and C++, scanning is done by _either_ pcre (which is more powerful), or flex (which is far, far faster). In Python and Perl, we use the Earley parsing algorithm, which runs in worst-case O(n^3) time where n is the length of the input, but which can handle any kind of grammar. In C and C++, we parse with either the Early algorithm, or with a parser generated by bison (which runs in O(n) time, but can only handle grammars in the LALR class). In practise, I believe that the O(n^3) case seldom occurs even with the slower algorithm, except with very pathological inputs. Nevertheless, since efficiency is a concern to many people, and since writing LALR grammars isn't that hard, I want to include the Bison option as well. Files are represented internally as typed trees. Every leaf of the tree contains a token, and the space immediately after that token. Within Perl and Python, nodes of the trees look just like native objects. Within C, they're implented in terms of glib (since the GNOME people seem to be the biggest desktop project using C). Within C++, they're implemented using C++'s STL. === FEATUES * You can build indices of nodes in the tree. Suppose you wanted a quick way to find the section of a lilo.conf file with a given label. You could write in Python: index = buildIndex(file, 'sections.*', # Make an index to members of sections '*.lbl' # indexed by the label element of any member # of that section ) and access the section named "Linux" as: index["Linux"] * You can tell the scanner to enter various states. For instance, suppose you wanted to have a key-value file, where the keys must be foo or bar, but the values may be anything up to the newline. If you wrote: Token Key = re{foo|bar}; Token Eq = "="; Token Val = re/.*$/; Token EOL = "\n"; then you'd have the problem that the line "foo=bar" would be scanned as [Key Eq Key EOL] instead of [Key Eq Val EOL]. One way to solve this problem is to use scanner states: Start State KeyState; State ValState; Token Key = re{foo|bar} in KeyState; Token Eq = "=" in KeyState enter ValState; Token Val = re/.*$/ in ValState; Token EOL = "\n" in ValState enter KeyState; Under this configuration, the scanner begins in 'KeyState'. In this state, it recognizes Key and Eq. When it sees Eq, it enters ValState. In ValState, it puts everything up to the EOL into a Val, and returns to KeyState on EOL. * You can put 'actions' in the scanner and the parser. An action is a piece of code that is executed whenever a rule matches. Within an action, you can call special functions to modify the behavior of the scanner or parser, or to call other functions within your program. Actions need to be specific for a single language. Here's how you might use actions in C to match perl-style regexen State regex; Start State S; Token reBegin = re're.' in S enter regex action=beginRegex; Token reEsc = re'\\.' in regex action=regexEsc; Token rePush = '{[(' in regex action=regexPush; Token rePop = '}])' in regex action=regexPop; Token reChar = '.' in regex action=regexChar; Action beginRegex in C = < Subject: Re: Call for Comments, Universal Parser In-Reply-To: Your message of "Sat, 27 Nov 1999 06:54:16 EST." <199911271154.GAA20400@belegost.mit.edu> Date: Sat, 27 Nov 1999 13:05:44 -0500 From: Nick Mathewson Roger wrote: [...] >> imageDeclarations = imageDecl* > >missing a semicolon? Yep. The grammar isn't really finalized yet; my parser doesn't enforce the semicolon properly. [...] >I don't have a good idea from this example of how brittle your parser >is. I read the lines about and think they're relatively intuitive to >me in terms of reading, but I'm not sure if they're intuitive in terms >of writing. It's pretty robust, I think. The only tricky bit is the fact that production forms can't be nested. In other words, you can't write things like: Name = Identifier* LongIdentifier+:ids; or Name = Sep(Identifier, comma+); or Name = [Full] Identifier+ | [Short] Identifier; This is necessary for us to correctly identify types for each expression. I don't believe this is a serious shortcoming, but others may disagree. [...] >>the changed version to disk. When you do so, all comments and formatting >>in the original will be preserved. > >Did your above example have a "these are the ways I do comments" declaration? Sorry; Parsely treats comments as just another kind of space, so you could write something like: Space re/#.*$/; # For perl-style comments Space re!/\* (\*[^/]|[^*])* \*/!x; # for C-style comments Do you forsee any problem with this? [...] >How easy would it be to transition from current configuration management >tools to using Parsely? Okay, here's the big misunderstanding. Parsely does _not_ replace any existing tool. What it does, is make it easier to write modules for _any_ configuration tool, present or future. As such, the transition would be an issue for tool developers, and hopefully invisible to the users. In my opinion, earlier projects of this nature have made several serious errors. One was attempting to compete with existing configuration projects. Parsely doesn't compete; it facilitates. Because you can use parsely to generate linuxconf modules, linuxconf developers ought to be happy that we're making their job easier. Keep in mind that a parsely module is useless on its own. Users don't want to type 'images['linux'].append("linear\n");' in order to force lilo to run in linear mode; they'd rather check a box on a GUI. Similarly, the developers of the GUI don't even want to say 'images['linux'].append("linear\n");' . That's too low-level. They'd rather write 'set_image_linear('linux', 1);' [...] >Is there a good way to macroize this, if this is a commonly occuring >situation? I'm thinking of adding a library of common 'tricky' lexemes, and providing import functionality. I already support macros in python, but they're pretty clunky. In any case, this is not version 0.1 functionality. :) [...] >>* I have no examples written. I'd like to get at least tcsh, bash, >> fvwm2, and procmail written. > >Sounds good. Doing examples for some end-user-oriented apps like samba >or a windowmanager might be neat too. Hm. I was under the impression that decent samba configuration engines already existed. (I might be wrong.) While it would be nice to get parsely modules written for everything, I'd like to start with some of the more patholgical cases, so as to exhibit Parsely's flexibility. >>=== QUESTIONS I HAVE >> >>* Do you think this tool will be useful? > >Yes. The only concern I have is activation energy for people to >transition over to it. It has to be better than what they're currently >using for users to care much. It has to be easier to work with (and >extend) than the current ones for programmers to care much. As noted above, the users won't notice a thing. All they'll realize is that their configuration tools are suddenly supporting things they didn't support before, and becoming much more robust, and more alike in they way they support varying file formats. As for the programmers, I doubt they'll change the ways they handle simple things like kvl files or windows-style .ini files, or even lilo.conf. But I believe that when they set out to write code to handle a new format, or when they want to handle a format too complex for an ad-hoc parser to handle, they'll look to parsely. I personally feel that it's far, far easier to write a parsely description of even lilo.conf than it is to write code to parse it and write it back while preserving formatting. If you know perl-style regular expressions and context-free grammars already, then the learning curve shouldn't be too steep at all. Another barier to adoption is the fact that hand-written scanners and parsers (especially parsers) have the potential to be smaller and faster than generated ones. My only counterargument here is to claim that, since I expect the genereated .o files for any given grammar to be around 30 to 60K, and since (in my experience) speed is perfectly acceptable even on a P133, there is no real reason to sacrifice maintainability for unneeded speed. >>* Of the unfinished portions in the previous section, which do you believe >> must be finished before the first developer release? Why? (Please >> consider partial completion where applicable. If C is necessary, is >> it necessary to include both [fast/general] types of scanner and lexer?) > >I think having "sufficient" functionality in any of those languages is >good enough for release. Do you mean that I should release once Python is working well enough, or once Python and at least one other language is working well enough, or once every language is working well enough? ==============================================================================