A JavaCC Parser
for the
Oberon-2 Programming Language

James Power

Department of Computer Science,
National University of Ireland,
Maynooth, Co. Kildare, Ireland.

17/11/98.


This document gives a brief description of a JavaCC parser for the programming language Oberon-2. The actual code is contained in the following files:


Introduction

The programming language Oberon is the latest generation in the Wirth family of languages, an heir to the Pascal and Modula tradition. Whatever its merits as a programming language, its design clearly has the compiler writer in mind. The language is small, its syntax is well defined and relatively easily parsed, and most of the features of the language can be understood from the language report, an incredibly concise document when compared to other languages of similar power.

Indeed the progress from Pascal through to Oberon stands in sharp contrast to the evolution from C to early C++ and onto ANSI-standardised C++. Writing a parser for Oberon is a reasonable undergraduate level project, or an afternoon's work for someone who knows what they're doing. Writing a parser for C++ is a mammoth task, involving the construction of a fully working symbol table.

In what follows I briefly describe some of the issues involved in constructing the Oberon parser.

Ambiguity Issues

I tried to keep to the EBNF specification as closely as possible, and this is probably a more realistic goal for Oberon than for most languages. A few changes were needed, however:
Simple Changes
Clearly the definition of a Qualident as [ident "."] ident, as well as the common prefix of Designator for assignment statements and procedure calls cause problems for LL parsing, but these are easily rectified by simple rearrangement.

Qualidents vs Designator
When parsing something like ident1.ident2.ident3 there's a choice between:
  Designator -> Qualident . ident2 . ident3
  Qualident -> ident1
and
  Designator -> Qualident . ident3
  Qualident -> ident1 . ident2
To resolve this I've changed the definition of Designator to
  designator = ident {"." ident | ..... }
This removes the ambiguity, at the cost of incorporating module names into designators, which will then have to be disentangled at the semantic stage

Designators vs Procedure Calls
This was a more awkward problem: a typeguard of the form v(i) asserts that v is of type i, and is syntactically indistinguishable from a procedure call, with v as the procedure name, and variable i as its argument. The ambiguity showed up in the grammar as a clash between the "(" Qualident() ")" allowed as part of a Designator(), and the optional "(" (ExprList())? ")" that could follow a Designator as part of a procedure call.

The possible parses are

  Factor -> Designator
  Designator -> Qualident "(" Qualident ")"
and alternatively:
  Factor -> Designator "(" ExprList ")"
  Designator -> Qualident
  ExprList -> Expr ->* Qualident
One way around this would be to distinguish between type names and variable names (as has to be done for e.g. C++), but this would require a functioning symbol table, which was something I didn't want to get into at this stage. Two syntactic solutions would be:
  1. Pull the arguments back into the definition of designator, giving something like
    Designator = Qualident {"." ident | "[" ExprList "]" | " ^ " | "(" [ExprList] ")"}.
    This is the simplest solution, but makes the grammar a little too loose, since we can now have all the Designator suffixes appearing immediately after a procedure call
  2. Decide that procedure calls with one (identifier) argument are going to be consumed as a Designator, any other type of call is recognised correctly - this dumps the problem of disentangling the incorrectly categorised procedure calls onto the semantic phase

The latter course was taken here; in particular, JavaCC's syntactic lookahead was used to eliminate the ambiguity in favour of the Designator().

A Lexical Issue

This problem was somewhat unexpected, and had me puzzled for a while; suppose we have:
 4..5
This is part of either a set element or a case guard, and should be tokenised as <INTEGER><DOTDOT><INTEGER>. However, if there's no space between the first integer and the dots, the scanner will consume as much as possible which in this case is "4.", and the token sequence returned is <REAL>"."<INTEGER>, giving a syntax error when the "." is seen.

To solve this I would have liked some kind of "trailing context" as per lex, or even some ability to give characters back to the tokeniser, but this didn't seem to be a possibility in JavaCC. The solution taken here was to trap any occurrences of (<DIGIT>)+ <DOTDOT>, and then separate them using Java code, creating a new token for the <DOTDOT> in the process, and linking it to the next field of the digits.

Aside:
In lex I could have specified the following, which would have recognised the digits and dots, without consuming the dots:
  {DIGIT}+/".." { return (INTEGER); }

Testing

The Oberon Bibliography mentions several books on Oberon, some of which have made their sample code available on the web. The parser has been tested on the code from the following books (and reports no syntax errors!): In total this gives about 92 programs, containing a cumulative 23,000 lines of Oberon code (including whitespace and comments). Most of the errors uncovered in my parser during the testing resulted from simple typos, though it was the code from the "Project Oberon" book that threw up the lexical problem regarding guards mentioned above.

References

Java and all Java-based trademarks and logos are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries.


Last Revised: 24th November 1998
Contact: James.Power@may.ie