The GNU bison parser generator, an implementation of yacc, is one of the most commonly used bottom-up parser generators. In particular, it is used in the C, C++, Java, and objective C front-ends for gcc, the GNU Compiler collection.
People often need to write source-code analysis programs and thus require parsers for the main commonly-used programming languages. However, figuring out the internal workings of a large compiler suite such as gcc is a daunting task; similarly, writing your own front end for a language like C++ is also non-trivial.
GNU bison can be run in "debug mode" where it will list the production rules applied, but this is not always the friendliest format to work with. We have modified the debugging output to instead produce XML tags, so that the parser now produces a version of its input marked up with tags that indicate the categories to which various constructs belong.
There's a high-level overview of this work, presented at the Working Conference on Reverse Engineering, 2002:
Our skeleton file is a modified version of the standard one:
bison --skeleton=bison.forxml your_file.yWhen compiling this, you need to define the flag YYXML to get the relevant code pulled in. The output is sent to stderr or, preferably, whatever the environment variable BISON_XML_FILE is set to. Anyway, it's much easier to see how all this fits together using an example. I've put together a trivial example using the simple expression grammar:
Our original goal in changing the bison skeleton in this way was to try and get the gcc parsers to tell us what they were doing. For example, the following illustrates the well-known declaration/expression ambiguity in C:
int x;
void f(int);
typedef int g;
int main()
{
f(x); /* Expression */
g(x); /* Declaration */
}
The ambiguity is resolved (naturally enough) by gcc, and, using our
modified version of bison, we get the following correct and different
mark-ups for the two lines
So - it's not pretty, but I took the view that it was better to
produce all possible tags and let you filter them out, rather than me
trying to figure out which ones you didn't want.The modifications to the gcc source basically involve changing the Makefiles to include the skeleton and switch on the flags. I've got an outline of how this can be done for gcc v3.0:
The modifications are minimal, but you need to be prepared to go in and have a look at the gcc source files, and of, course, re-build the compiler suite itself. It works for gcc 3.0, but gcc is something of a moving target - best of luck if you want to try this!
| The bison and gcc code that I modified is covered by the GNU General Public License. The resulting versions/modifications presented above should also be considered to be so covered. |
![]() |
James Power, Dept. of Computer Science Last revised: 10 May 2005 |