A JavaCC Parser
for the
Oberon-2 Programming Language
James Power
Department of Computer Science,
National University of Ireland,
Maynooth, Co. Kildare, Ireland.
17/11/98.
This document gives a brief description of a JavaCC parser for the programming
language Oberon-2. The actual code is contained in the following files:
Introduction
The programming language Oberon is the latest generation in the Wirth family
of languages, an heir to the Pascal and Modula tradition. Whatever its
merits as a programming language, its design clearly has the compiler writer
in mind. The language is small, its syntax is well defined and relatively
easily parsed, and most of the features of the language can be understood
from the language report, an incredibly concise document when compared
to other languages of similar power.
Indeed the progress from Pascal through to Oberon stands in sharp
contrast to the evolution from C to early C++ and onto ANSI-standardised
C++. Writing a parser for Oberon is a reasonable undergraduate level project,
or an afternoon's work for someone who knows what they're doing. Writing
a parser for C++ is a mammoth task, involving the construction of a fully
working symbol table.
In what follows I briefly describe some of the issues involved
in constructing the Oberon parser.
Ambiguity Issues
I tried to keep to the EBNF specification as closely as possible, and this
is probably a more realistic goal for Oberon than for most languages. A
few changes were needed, however:
-
Simple Changes
-
Clearly the definition of a Qualident as [ident "."] ident, as well as
the common prefix of Designator for assignment statements and procedure
calls cause problems for LL parsing, but these are easily rectified by
simple rearrangement.
-
Qualidents vs Designator
-
When parsing something like ident1.ident2.ident3
there's a choice between:
Designator -> Qualident . ident2 . ident3
Qualident -> ident1
and
Designator -> Qualident . ident3
Qualident -> ident1 . ident2
To resolve this I've changed the definition of Designator to
designator = ident {"." ident | ..... }
This removes the ambiguity, at the cost of incorporating module names into
designators, which will then have to be disentangled at the semantic stage
-
Designators vs Procedure Calls
-
This was a more awkward problem: a typeguard of the form v(i) asserts
that v is of type i, and is syntactically indistinguishable
from a procedure call, with v as the procedure name, and variable
i as its argument. The ambiguity showed up in the grammar as a clash
between the "(" Qualident() ")" allowed as part of a Designator(), and
the optional "(" (ExprList())? ")" that could follow a Designator as part of
a procedure call.
The possible parses are
Factor -> Designator
Designator -> Qualident "(" Qualident ")"
and alternatively:
Factor -> Designator "(" ExprList ")"
Designator -> Qualident
ExprList -> Expr ->* Qualident
One way around this would be to distinguish between type names and variable
names (as has to be done for e.g. C++), but this would require a functioning
symbol table, which was something I didn't want to get into at this stage.
Two syntactic solutions would be:
-
Pull the arguments back into the definition of designator, giving something
like
Designator = Qualident {"." ident | "[" ExprList "]" | " ^ " | "(" [ExprList] ")"}.
This is the simplest solution, but makes the grammar a little too loose,
since we can now have all the Designator suffixes appearing immediately
after a procedure call
-
Decide that procedure calls with one (identifier) argument are going to
be consumed as a Designator, any other type of call is recognised correctly
- this dumps the problem of disentangling the incorrectly categorised procedure
calls onto the semantic phase
The latter course was taken here; in particular, JavaCC's syntactic lookahead
was used to eliminate the ambiguity in favour of the Designator().
A Lexical Issue
This problem was somewhat unexpected, and had me puzzled for a while; suppose
we have:
4..5
This is part of either a set element or a case guard, and should be tokenised
as <INTEGER><DOTDOT><INTEGER>. However, if there's no
space between the first integer and the dots, the scanner will consume
as much as possible which in this case is "4.", and the token
sequence returned is <REAL>"."<INTEGER>, giving a syntax
error when the "." is seen.
To solve this I would have liked some kind of "trailing context"
as per lex, or even some ability to give characters back to the tokeniser,
but this didn't seem to be a possibility in JavaCC. The solution taken
here was to trap any occurrences of (<DIGIT>)+ <DOTDOT>,
and then separate them using Java code, creating a new token for the <DOTDOT>
in the process, and linking it to the next field of the digits.
-
Aside:
-
In lex I could have specified the following, which would have recognised
the digits and dots, without consuming the dots:
{DIGIT}+/".." { return (INTEGER); }
Testing
The Oberon Bibliography mentions several books on Oberon, some of which have made their sample code available on the web. The parser has been tested on the code from the following books (and reports no syntax errors!):
- Programming in Oberon - Steps Beyond Pascal and Modula by N. Wirth and M. Reiser
- Object-Oriented Programming in Oberon-2 by H. Mössenböck
- Compiler Construction by N. Wirth
- Project Oberon - The Design of an Operating System and Compiler by N. Wirth and J. Gutknecht
In total this gives about 92 programs, containing a cumulative 23,000 lines of Oberon code (including whitespace and comments). Most of the errors uncovered in my parser during the testing resulted from simple typos, though it was the code from the "Project Oberon" book that threw up the lexical problem regarding guards mentioned above.
References
Java and all Java-based trademarks and logos are trademarks
or registered trademarks of Sun Microsystems, Inc. in the United States
and other countries.
Last Revised: 24th November 1998
Contact: James.Power@may.ie