Discussion:
help wanted - compiler for a new language, written in Modula-2
(too old to reply)
m***@gmail.com
2017-04-11 06:06:06 UTC
Permalink
Raw Message
I am working on a new computer language, and have written a first cut recursive descent transpiler (converting from the new language into AS3 and JavaScript), but need someone to redo the compiler using top down operator precedence (as presented in the lectures by Douglas Crockford), using the Pratt method, which is a fusion of left-right precedence parsing and top down recursive descent. It would be a lot faster than the backtracking that is currently in my compiler. If you know what this all means, and program in modula-2, then please send me a note.

Currently using the ADW modula-2 compiler on Windows, but have a very nice cross-platform low level library that has a version for the P1 compiler.

Will pay commensurate with skill and quality of output.
j***@gmail.com
2017-04-12 09:31:45 UTC
Permalink
Raw Message
Perhaps combine it with the MC project?
c***@gmail.com
2017-04-14 10:22:53 UTC
Permalink
Raw Message
Is the grammar for your new language LL-1 ?

Sincerely
Carl Glassberg
----
Post by m***@gmail.com
I am working on a new computer language, and have written a first cut recursive descent transpiler (converting from the new language into AS3 and JavaScript), but need someone to redo the compiler using top down operator precedence (as presented in the lectures by Douglas Crockford), using the Pratt method, which is a fusion of left-right precedence parsing and top down recursive descent. It would be a lot faster than the backtracking that is currently in my compiler. If you know what this all means, and program in modula-2, then please send me a note.
Currently using the ADW modula-2 compiler on Windows, but have a very nice cross-platform low level library that has a version for the P1 compiler.
Will pay commensurate with skill and quality of output.
m***@gmail.com
2017-04-15 06:48:50 UTC
Permalink
Raw Message
i have an EBNF grammar for the language; it is an indented language like Python. It isn't that hard to parse, about the same size grammar as Swift. After watching the Crockford video on youtube about the simplicity and speed of the Pratt methodology, it is clear that the compiler will be much faster with less backtracking if it uses that approach. A real sticking point in my first cut compiler which was purely recursive descent is that error recovery is very hard to do in recursive descent, and a huge amount of extra code exists to exit out of deeply nested function calls. This was always a weak spot in the Wirth code examples which i learned from. Production compilers are very different than academic versions, as nice error messages are a must for the next 100 million programmers to be. It includes in the language a full 2D graphics and event model and a graph database, as well as shell scripting capabilities. It is designed to replace the full development stack for mobile and web development, which means it is aiming directly at the largest single software development market which is interactive graphics applications for mobile + web. Basically you write your code in a single language. With no make scripts, no package managers, no external database, no frameworks at all, and under 30 API functions for a typical program. It will be competing against Eve, Elm, Red, and a few other "next-gen" languages.
Marco van de Voort
2017-04-15 12:15:25 UTC
Permalink
Raw Message
Post by m***@gmail.com
i have an EBNF grammar for the language; it is an indented language like
Python. It isn't that hard to parse, about the same size grammar as
Swift. After watching the Crockford video on youtube about the simplicity
and speed of the Pratt methodology, it is clear that the compiler will be
much faster with less backtracking if it uses that approach. A real
sticking point in my first cut compiler which was purely recursive descent
is that error recovery is very hard to do in recursive descent, and a huge
amount of extra code exists to exit out of deeply nested function calls.
This was always a weak spot in the Wirth code examples which i learned
from.
Strangely I always considered the Wirth languages the ones with good error
handling support, and that was often attributed to them typically being RD
as almost LL(1) languages too.

It might simply be a bad fit for your particular language though.
Chris Burrows
2017-04-16 01:23:12 UTC
Permalink
Raw Message
Post by m***@gmail.com
A real
sticking point in my first cut compiler which was purely recursive descent
is that error recovery is very hard to do in recursive descent, and a huge
amount of extra code exists to exit out of deeply nested function calls.
This was always a weak spot in the Wirth code examples which i learned
from.
How old were these examples that you were using? The strategy used by Wirth in his latest Project Oberon compiler does not have that problem. He describes the strategy as follows:

"Also, the language Oberon is designed with the property that most large constructs begin with a unique symbol, such as IF, WHILE, CASE, RECORD, etc. These symbols facilitate the recovery of the parsing process in the erroneous text. More problematic are open constructs which neither begin nor end with key symbols, such as types, factors, and expressions. Relying on heuristics, the source text is skipped up to the first occurrence of a symbol which may begin a construct that follows the one being parsed. The employed scheme may not be the best possible, but it yields quite acceptable results and keeps the amount of program devoted to the handling of erroneous texts within justifiable bounds."

For the full source code of including extensive documentation of his RISC5 Oberon compiler and the operating system that was built entirely using the language see 'Project Oberon 2013' at:

https://people.inf.ethz.ch/wirth/

If your compiler is having difficulty parsing your language then you should consider that as a warning sign that humans may have similar difficulties when trying to fully comprehend your language. I'd advise you to rethink the language design for areas that are causing you specific problems.

Regards,
Chris Burrows
CFB Software
http://www.astrobe.com/RISC5
trijezdci
2017-07-29 14:34:45 UTC
Permalink
Raw Message
Post by m***@gmail.com
i have an EBNF grammar for the language;
Is it LL(1) though?
Post by m***@gmail.com
it is clear that the compiler will be much faster with less backtracking if it uses that approach.
An RD parser for an LL(1) grammar should not have to backtrack.
Post by m***@gmail.com
A real sticking point in my first cut compiler which was purely recursive descent is that error recovery is very hard to do in recursive descent,
I always found it much harder to do error recovery with table driven parsers.
Post by m***@gmail.com
and a huge amount of extra code exists to exit out of deeply nested function calls.
Then you are doing something wrong.
Post by m***@gmail.com
This was always a weak spot in the Wirth code examples which i learned from.
Wirth never taught error recovery. His examples were meant to illustrate the general process of recursive descent, not its error recovery. You may want to read other authors who actually cover error recovery. For example Dick Grune.

The key to good error recovery in an RD parser for an LL(1) grammar is to

(1) implement FIRST() and FOLLOW() sets as functions the parser can call
(2) don't nest match token/set tests, but code the tests sequentially
(3) if a test fails, skip to a token that is expected by the next test

You can see this approach all over my parser. For example, take a look at the function that implements array type declaration

https://github.com/m2sf/m2bsk/blob/master/src/imp/Parser.mod#L945

arrayType :=
ARRAY valueCount OF typeIdent
;

One might be inclined to code this in a nested fashion like so

(* ARRAY *)
lookahead := Lexer.consumeSym(lexer);

(* valueCount *)
IF matchSet(FIRST(Expression)) THEN
lookahead := expression(valueCount);

(* OF *)
IF matchToken(Token.Of) THEN
lookahead := Lexer.consumeSym(lexer);

(* typeIdent *)
IF matchSet(FIRST(Qualident)) THEN
lookahead := qualident(baseType)

Of course this works, but this makes error recovery very difficult because you need to back out of your nested IFs and then figure out how the parser could continue. Generally speaking, when you are at the end of the nested block, you have progressed too far in the code to continue and you are forced to skip even further in the hope to find a sensible resync point.

The approach I outlined above in the three bullet points doesn't have this problem. You'd code the tests in sequence like so

(* ARRAY *)
lookahead := Lexer.consumeSym(lexer);

(* valueCount *)
IF matchSet(FIRST(Expression)) THEN
lookahead := expression(valueCount);
ELSE (* resync *)
lookahead := skipToMatchTokenOrSet(Token.Of, FIRST(TypeIdent))
END; (* IF *)

(* OF *)
IF matchToken(Token.Of) THEN
lookahead := Lexer.consumeSym(lexer);
ELSE (* resync *)
lookahead := skipToMatchSetOrSet(FIRST(Qualident), FOLLOW(ArrayType))
END (* IF *)

(* typeIdent *)
IF matchSet(FIRST(Qualident)) THEN
lookahead := qualident(baseType)
ELSE (* resync *)
lookahead := skipToMatchSet(FOLLOW(ArrayType))
END (* IF *)

With this approach, when an error occurs, the parser will skip to the token that will be expected by the following test and continue. It is then just a matter of fine tuning the resync points, that is to say which tokens to skip to.

For this you will need to write a number of resync functions for different cases. In my parser I started out with just two, skipToMatchToken and skipToMatchSet, but soon found that I needed to add more, like skiptToMatchTokenOrToken, skipToMatchTokenOrSet and skipToMatchTokenOrTokenOrSet.

The benefit of this approach is that your error recovery is then easily tunable by simply changing the resync tokens. With nested tests, you can't adjust that easily. If you want to change your resync points in a nested IF structure, you can't just change the resync tokens, you have to change the code as well.

If you read authors who handle error recovery, like Grune, you will learn that some approaches involve inserting tokens into the stream instead of skipping tokens when an error is detected, in other words, error recovery assumes that some token is missing the source code.

You will find that the structure of sequential tests I describe above can easily accommodate such a strategy as well. The flatter your structure, the more flexible it is.

hth
benjamin
Pascal J. Bourguignon
2017-07-29 20:09:44 UTC
Permalink
Raw Message
Post by trijezdci
Post by m***@gmail.com
This was always a weak spot in the Wirth code examples which i learned from.
Wirth never taught error recovery. His examples were meant to
illustrate the general process of recursive descent, not its error
recovery. You may want to read other authors who actually cover error
recovery. For example Dick Grune.
On the other hand, when you have a fast (unsophisticated, therefore
correct and fast) compiler, you can easily compile up to the first
error, correct it in the IDE, and then compiler again up to the next
error.

Error recovery was important when you used batch compilers in batch
environments, but not when you're compiling interactively in an
interactive IDE.
--
__Pascal J. Bourguignon
http://www.informatimago.com
trijezdci
2017-07-29 13:21:56 UTC
Permalink
Raw Message
I am working on a new computer language, and have written a first cut recursive descent transpiler (converting from the new language into AS3 and JavaScript), but need someone to redo the compiler using top down operator precedence (as presented in the lectures by Douglas Crockford), using the Pratt method, which is a fusion of left-right precedence parsing and top down recursive descent. It would be a lot faster than the backtracking that is currently in my compiler. If you know what this all means, and program in modula-2, then please send me a note.rsion for the P1 compiler.
Sorry for the late response. You want to make sure that your language grammar is LL(1) and the issue you describe will be avoided in the first place. No backtracking is needed when parsing LL(1) grammars.

For an example of an expression grammar that is LL(1) see

https://github.com/m2sf/m2bsk/wiki/Syntax-Diagrams#expression

The grammar imposes left-associativity in line with the rules of mathematics. A parser following that grammar will not need to backtrack nor reorder terms.

However, if you were to add exponentiation to this expression grammar, like

simpleTerm :=
NOT factor | factor ( '**' factor )
;

then things would get a little more interesting because the above grammar imposes left associativity which is what you want for just about all math operations but not exponentiation which is right associative.

In this case it then depends on whether or not you are building an abstract syntax tree (AST) as multi-pass compilers do, or using syntax directed translation as single pass compilers do.

If you are building an AST, associativity of an expression subtree entirely depends how you insert new nodes into the existing tree. You are in control of that, so there is no issue, no backtracking, no shunting, nor any other post processing needed.

If you are doing syntax directed translation, then you'd need to shunt the factors, for example by pushing them on a stack and then pop them for evaluation once you have reached the end of the expression. The overhead associated with that is negligible. You'd pay more for creating AST nodes in a multi-pass compiler than you pay for pushing and popping values to and from a pre-allocated stack in a single pass compiler. I am not a fan of syntax directed translation but this particular scenario is not an issue.


For an example of how to insert AST nodes for the above expression grammar in the parser (written in Modula-2), see

https://github.com/m2sf/m2bsk/blob/master/src/imp/Parser.mod#L4000

hth
benjamin
Loading...