XML output for GNU bison and gcc

James Power, NUI Maynooth

The GNU bison parser generator, an implementation of yacc, is one of the most commonly used bottom-up parser generators. In particular, it is used in the C, C++, Java, and objective C front-ends for gcc, the GNU Compiler collection.

People often need to write source-code analysis programs and thus require parsers for the main commonly-used programming languages. However, figuring out the internal workings of a large compiler suite such as gcc is a daunting task; similarly, writing your own front end for a language like C++ is also non-trivial.

GNU bison can be run in "debug mode" where it will list the production rules applied, but this is not always the friendliest format to work with. We have modified the debugging output to instead produce XML tags, so that the parser now produces a version of its input marked up with tags that indicate the categories to which various constructs belong.

There's a high-level overview of this work, presented at the Working Conference on Reverse Engineering, 2002:

Generating XML from a bison-based parser

To get bison's debugging to generate the XML tags, it is necessary to change the C-code output produced by the parser generator. This is considerably facilitated by the inclusion of the --skeleton option since bison version 1.28a (in August 2001). This allows the user to specify an arbitrary parsing routine, while bison will generate the necessary parse tables and lists of symbols to support it.

Our skeleton file is a modified version of the standard one:

To use this, just run bison with the extra command-line option: When compiling this, you need to define the flag YYXML to get the relevant code pulled in. The output is sent to stderr or, preferably, whatever the environment variable BISON_XML_FILE is set to. Anyway, it's much easier to see how all this fits together using an example. I've put together a trivial example using the simple expression grammar: The README file in this .zip archive gives all the relevant information.

Note: the approach described below won't work with current versions of gcc, since it now uses a recursive-descent parser.
Try the g4re toolchain instead!

Integrating this into gcc

Our original goal in changing the bison skeleton in this way was to try and get the gcc parsers to tell us what they were doing. For example, the following illustrates the well-known declaration/expression ambiguity in C:

The ambiguity is resolved (naturally enough) by gcc, and, using our modified version of bison, we get the following correct and different mark-ups for the two lines So - it's not pretty, but I took the view that it was better to produce all possible tags and let you filter them out, rather than me trying to figure out which ones you didn't want.

The modifications to the gcc source basically involve changing the Makefiles to include the skeleton and switch on the flags. I've got an outline of how this can be done for gcc v3.0:

The README file in this .zip archive is mine, and a good place to start.

The modifications are minimal, but you need to be prepared to go in and have a look at the gcc source files, and of, course, re-build the compiler suite itself. It works for gcc 3.0, but gcc is something of a moving target - best of luck if you want to try this!

Processing the XML Output

Mark Hennessy has developed some tools for processing the XML tags outputted by the bison/gcc combination described above. See:

The bison and gcc code that I modified is covered by the GNU General Public License. The resulting versions/modifications presented above should also be considered to be so covered.

NUIM Logo James Power,
Dept. of Computer Science
Last revised: 10 May 2005