Discussion:
PIM/ISO vs R10 number literals
(too old to reply)
trijezdci
2016-08-30 10:38:33 UTC
Permalink
I'd hope there is a pragma to permit old style constants or you
immediately orphan all pre-existing Modula 2 code by requiring rework
just to try it out with the new compiler.
I am assuming you mean "number literals" when you say "old style constants".

We decided to abandon suffixed literals because they violate the very design philosophy Wirth has promoted in all of his compiler design texts, namely that design should follow (a) human readability and (b) single input symbol lookahead.

Prefixed literals are both better readable to humans and to machines alike.

Instead of providing a back door to use suffixed literals, we will provide a source-to-source translator. This can then be used to replace all suffixed literals in an input source file into prefixed equivalents in the output file.
Constant delimiters 123'456'789 is there any reason why it can't be ,
(granted some cultures use "," for "." in FP numbers but M2 doesn't)
You would need to use whitespace to differentiate between:

numbers with digit separators

123,345,789

and comma separated lists of numbers

123, 345, 789


This would increase opportunity for error.

Also, it would decrease readability when you have comma separated lists of numbers with digit separators in them

123,456,789, 321,654,987, ...

Apart from that, it requires two-character lookahead when lexing numbers. Granted we have to use two-character lookahead already when encountering a "." in a number to distinguish decimal point "." from ".." but this shouldn't be considered a get-out-of-jail-free card for more cases needing such disambiguation.

Other languages that provide digit separators often use "_" as separator, for example Ada uses it. We toyed with that but found it looks less "natural" than using apostrophe as separator.

123_456_789

vs

123'456'789

I hope this makes sense.
Marco van de Voort
2016-08-30 12:17:05 UTC
Permalink
Post by trijezdci
I'd hope there is a pragma to permit old style constants or you
immediately orphan all pre-existing Modula 2 code by requiring rework
just to try it out with the new compiler.
I am assuming you mean "number literals" when you say "old style constants".
We decided to abandon suffixed literals because they violate the very
design philosophy Wirth has promoted in all of his compiler design texts,
namely that design should follow (a) human readability and (b) single
input symbol lookahead.
All literals are one symbol (done by tokenizer/scanner, not parser), so IMHO
this is an bogus argument. And I don't think readability is better or worse.

So basically it leaves Cification as unmentioned but clear driving force. Or
at best new/change of syntax as a goal in itself.
trijezdci
2016-08-30 13:31:34 UTC
Permalink
Post by Marco van de Voort
Post by trijezdci
We decided to abandon suffixed literals because they violate the very
design philosophy Wirth has promoted in all of his compiler design texts,
namely that design should follow (a) human readability and (b) single
input symbol lookahead.
All literals are one symbol (done by tokenizer/scanner, not parser),
Scientifically speaking there is no difference between lexing and parsing. The distinction is entirely arbitrary. It is merely made for the convenience of the implementor. Some parsing methods don't make the distinction. See Dick Grune's seminal work on parsing techniques for a more detailed discussion.
Post by Marco van de Voort
And I don't think readability is better or worse.
Readability, or at least some effects of it (or lack thereof) can be measured.

Suffix notation causes more eye movement which increases mental load which is an indicator for lesser readability. The effect is not noticeable with very short literals such as 040H, but it increases significantly with literal length.

Consequently this wasn't a serious issue in an era where all hardware was limited to an address space that could be encoded in four digit base-8 numbers.

However, in this day and age we are dealing with significantly longer literals where it is an issue and we addressed this by switching to the superior prefix notation of C.

There are a very few occasions where Kernighan and Ritchie simply had a better idea than Wirth. This is one of them. We can choose to be mature and acknowledge when somebody else simply had the better idea even if we don't like much else about that somebody's work. Or we can choose to be dismissive against evidence. The latter will only hurt ourselves.
Marco van de Voort
2016-08-30 15:16:40 UTC
Permalink
Post by trijezdci
Post by Marco van de Voort
Post by trijezdci
namely that design should follow (a) human readability and (b) single
input symbol lookahead.
All literals are one symbol (done by tokenizer/scanner, not parser),
Scientifically speaking there is no difference between lexing and parsing.
The distinction is entirely arbitrary. It is merely made for the
convenience of the implementor. Some parsing methods don't make the
distinction. See Dick Grune's seminal work on parsing techniques for a
more detailed discussion.
Regardless of wordplay, was Wirth's remark actually in such wider context?
Post by trijezdci
Post by Marco van de Voort
And I don't think readability is better or worse.
Readability, or at least some effects of it (or lack thereof) can be measured.
Suffix notation causes more eye movement which increases mental load which
is an indicator for lesser readability. The effect is not noticeable with
very short literals such as 040H, but it increases significantly with
literal length.
Afaik that only counts for untrained people. In trained people the shift
from characterwise to more wordwise reading compensates.

This is why natural languages also have suffixes.
trijezdci
2016-08-30 16:37:58 UTC
Permalink
Post by Marco van de Voort
Post by trijezdci
Scientifically speaking there is no difference between lexing and parsing.
The distinction is entirely arbitrary. It is merely made for the
convenience of the implementor. Some parsing methods don't make the
distinction. See Dick Grune's seminal work on parsing techniques for a
more detailed discussion.
Regardless of wordplay, was Wirth's remark actually in such wider context?
I do not accept the premise that there is a fundamental difference between lexing and parsing. As Dick Grune's work will attest, I am in good company.

Under this premise there is then no such thing as a narrower or wider context. There is only one context and that is the context of symbol stream processing.

What justification would there be that a sound principle of efficient symbol stream processing (read one symbol ahead at a time to decide how to proceed without having to backtrack) is worthwhile adhering to when processing a symbol stream with tokens representing lexemes of length > 1 but it is not worthwhile adhering to when processing a symbol stream with tokens representing themselves and are lexemes of length = 1?
Post by Marco van de Voort
Post by trijezdci
Suffix notation causes more eye movement which increases mental load which
is an indicator for lesser readability. The effect is not noticeable with
very short literals such as 040H, but it increases significantly with
literal length.
Afaik that only counts for untrained people. In trained people the shift
from characterwise to more wordwise reading compensates.
Like I said, the effect increases with length. Eye movement studies with texts in languages with very long words such as Cymraeg (aka Welsh), Finnish and German have shown that the effect remains even for proficient readers.
Post by Marco van de Voort
This is why natural languages also have suffixes.
And the vast majority of suffixed use cases in natural languages are short words. Aplty, briefly, candidly, distantly, easily, fairly, grizly, highly, etc etc etc. Very long words are generally nouns and they are used without suffixes. Fussbodenschleifmaschinenverleih, not fussbodenschleifmaschinenverleihlich.
Marco van de Voort
2016-08-31 08:29:55 UTC
Permalink
Post by trijezdci
Post by Marco van de Voort
Post by trijezdci
convenience of the implementor. Some parsing methods don't make the
distinction. See Dick Grune's seminal work on parsing techniques for a
more detailed discussion.
Regardless of wordplay, was Wirth's remark actually in such wider context?
I do not accept the premise that there is a fundamental difference between lexing and parsing. As Dick Grune's work will attest, I am in good company.
Well, it does matter if it is one char or one symbol lookahead.
Post by trijezdci
Post by Marco van de Voort
This is why natural languages also have suffixes.
And the vast majority of suffixed use cases in natural languages are short
words. Aplty, briefly, candidly, distantly, easily, fairly, grizly,
highly, etc etc etc. Very long words are generally nouns and they are
used without suffixes. Fussbodenschleifmaschinenverleih, not
fussbodenschleifmaschinenverleihlich.
Those are aggregates, not single words. Moreover -lich is a suffix. (-like)

Anyway, most constants remain short. A few rare times you define a mask
with all like $FFFF or $7FFF (and its 64-bit variants), and those are
repeating.

I still think the whole reasoning for this change is totally bogus and
dragged-by-the-hairs.
trijezdci
2016-08-31 10:58:16 UTC
Permalink
Post by Marco van de Voort
Well, it does matter if it is one char or one symbol lookahead.
No it doesn't. Processing the symbol stream treats either as a single unit, which is the whole reason to use a tokeniser as a front end to a parser.
Post by Marco van de Voort
Those are aggregates, not single words.
The point was that they are nouns.
Post by Marco van de Voort
Moreover -lich is a suffix. (-like)
And I used it as such.
Post by Marco van de Voort
Anyway, most constants remain short. A few rare times you define a mask
with all like $FFFF or $7FFF (and its 64-bit variants), and those are
repeating.
Funny that you chose to use prefixed literals there apparently out of preference while at the same time calling that very choice "totally bogus".

In any event, the world has moved on to those totally bogus prefix literals in just about every notation imaginable while accepting the arguments presented in their favour. By contrast, those who are attached to not totally bogus or totally not bogus suffix literals are in a vanishingly tiny minority.
r***@gmail.com
2016-09-01 23:42:09 UTC
Permalink
Hi,
Post by trijezdci
Post by Marco van de Voort
Anyway, most constants remain short. A few rare times you define
a mask with all like $FFFF or $7FFF (and its 64-bit variants),
and those are repeating.
Funny that you chose to use prefixed literals there apparently out
of preference while at the same time calling that very choice
"totally bogus".
He's a core FreePascal member, so it's not surprising that he mentions
(Turbo Pascal-ish) '$' (hex) prefixes. Heck, I almost mentioned it myself!
Post by trijezdci
In any event, the world has moved on to those totally bogus prefix
literals in just about every notation imaginable while accepting the
arguments presented in their favour. By contrast, those who are attached
to not totally bogus or totally not bogus suffix literals are in a
vanishingly tiny minority.
Intel-style x86 assembly still overwhelmingly uses 0Ah notation for hex.
Many assemblers support various styles (including all of those mentioned
above), but it's still very common to see, e.g. "stack 8000h" or
"test al,1100b" (FASM).
trijezdci
2016-09-02 00:50:05 UTC
Permalink
Post by r***@gmail.com
Post by trijezdci
Funny that you chose to use prefixed literals there apparently out
of preference while at the same time calling that very choice
"totally bogus".
He's a core FreePascal member, so it's not surprising that he mentions
(Turbo Pascal-ish) '$' (hex) prefixes. Heck, I almost mentioned it myself!
Yes I am aware of that. I pointed this out because of the apparent hypocrisy involved: If Borland or the FPC project make that choice, apparently that is alright. But if we make that choice then it is "totally bogus". This ain't right.
Marco van de Voort
2016-09-02 09:28:26 UTC
Permalink
Post by trijezdci
Post by r***@gmail.com
He's a core FreePascal member, so it's not surprising that he mentions
(Turbo Pascal-ish) '$' (hex) prefixes. Heck, I almost mentioned it myself!
Yes I am aware of that. I pointed this out because of the apparent
hypocrisy involved: If Borland or the FPC project make that choice,
apparently that is alright. But if we make that choice then it is
"totally bogus". This ain't right.
"Totally bogus" was a reference to the reasons to change it, not suffix
syntax in general. To my best knowledge Borland (and related dialects)
never changed from suffix from prefix.

If tomorrow sb arrived on the FPC lists/forum with similar arguments to
change the syntax to suffix, I'd react the same.
trijezdci
2016-09-02 12:07:20 UTC
Permalink
Post by Marco van de Voort
Post by trijezdci
Yes I am aware of that. I pointed this out because of the apparent
hypocrisy involved: If Borland or the FPC project make that choice,
apparently that is alright. But if we make that choice then it is
"totally bogus". This ain't right.
"Totally bogus" was a reference to the reasons to change it, not suffix
syntax in general. To my best knowledge Borland (and related dialects)
never changed from suffix from prefix.
If tomorrow sb arrived on the FPC lists/forum with similar arguments to
change the syntax to suffix, I'd react the same.
That's a strawman. If it really was as you claim, that would be an argument from indifference, but you didn't display any indifference. In fact, the words "totally bogus" alone do not go along with indifference.

A true argument from indifference might have been something along the following lines ...

"I do not believe the effects of the research you mentioned to be that strong and important, therefore I believe this is more of a matter of preference than it is a matter of readability. But at the end of the day I can live with either choice."
Marco van de Voort
2016-09-02 13:47:19 UTC
Permalink
Post by trijezdci
Post by Marco van de Voort
If tomorrow sb arrived on the FPC lists/forum with similar arguments to
change the syntax to suffix, I'd react the same.
That's a strawman.
"I don't subscribe to your linguistic analysis of MY answer."

I hope I have now put it in a format you can understand.
trijezdci
2016-09-02 18:39:48 UTC
Permalink
Post by Marco van de Voort
"I don't subscribe to your linguistic analysis of MY answer."
I hope I have now put it in a format you can understand.
I was trying to be polite but you just don't seem to get it.

YOU WERE BEING RUDE with your language. And in case you were wondering, the clue lies in the words "totally bogus". Please learn some manners.
Marco van de Voort
2016-09-03 14:10:15 UTC
Permalink
Post by trijezdci
Post by Marco van de Voort
I hope I have now put it in a format you can understand.
I was trying to be polite but you just don't seem to get it.
YOU WERE BEING RUDE with your language.
Pot, meet kettle. It was a reaction to your aggressive stance and the so
called evidence and more importantly the application of it to suffix
literals in languages.

The literals are typically very short (and IMHO 64-bit compiler don't change
that much, can't remember the last time I hardcoded a pointer value) and
besides that simple in structure. A whole number like 10000 or something
repetitive as 99999 is much more common than a random number 153872 and
clearly fall in the category of reading them in one glance, wordwise.

So you whole reasoning sounded as something pulled out of a hat to justify
some decision already taken based on other grounds.

I still stand by that opinion, and this is my last response on the matter.
trijezdci
2016-09-03 17:02:00 UTC
Permalink
This post might be inappropriate. Click to display it.
r***@gmail.com
2016-09-02 21:27:38 UTC
Permalink
Hi,

Just to refocus this thread on technical concerns ....
Post by Marco van de Voort
Post by trijezdci
Post by r***@gmail.com
He's a core FreePascal member, so it's not surprising that he mentions
(Turbo Pascal-ish) '$' (hex) prefixes. Heck, I almost mentioned it myself!
Yes I am aware of that. I pointed this out because of the apparent
hypocrisy involved: If Borland or the FPC project make that choice,
apparently that is alright. But if we make that choice then it is
"totally bogus". This ain't right.
"Totally bogus" was a reference to the reasons to change it, not suffix
syntax in general. To my best knowledge Borland (and related dialects)
never changed from suffix from prefix.
If tomorrow sb arrived on the FPC lists/forum with similar arguments to
change the syntax to suffix, I'd react the same.
Overall, it doesn't really matter. Small changes like this are the least
of anyone's worries.

I still prefer 0--h for hex, but that's just me. I've seen people use
0x and $ in Intel assembly, but I always found it odd. The overwhelming
majority seems to prefer 0--h there. Of course, I also shun AT&T syntax,
but some people still prefer that (even though GAS has supported both
since many years), so who knows.

Just for clarity, FPC (actually, 3.0.0's ppcross8086) supports inline
assembly, but it only supports 0--h or $ and thus not 0x at all.
(This may vary based upon bin writer or external assembler, of course.
I honestly don't know.)

Here, I'll just point to this to pretend to be exhaustive:
http://www.freepascal.org/docs-html/ref/refse6.html

How does Ada do it? A quick search shows 16#FF# for 255. Similar (but
not quite) to Extended Pascal (16#FF) and Modula-3 (16_FF).

But almost anything is better than octal! :-P
trijezdci
2016-09-02 23:36:29 UTC
Permalink
Post by r***@gmail.com
Just to refocus this thread on technical concerns ....
Overall, it doesn't really matter. Small changes like this are the least
of anyone's worries.
Have you heard of TQM? Or Six Sigma, Kaizen, Lean?

These are quality management methodologies whose aim it is to reliably raise quality to levels so close to 100% that it is for all practical purposes indistinguishable from 100% even if 100% is not actually reachable.

Six Sigma for example sets a goal of 99.99966% quality or 3.4 defects in one million opportunities. When you get so close to 100, the weirdest things happen. If you are coming from a natural science background, a good analogy might be quantum mechanics. Observable events simply seem to defy common sense.

For example, a car factory under such a quality regime might find that their quality significantly dropped below target after they changed the subcontractor who cleans the towels for the towel drums in the bath rooms. Nobody can figure out what's wrong with those towels, they seem just fine, but the change in the towel cleaning firm messes up the quality of the cars made at the plant.

It would seem totally whacko but this is the kind of thing you will observe when taking quality to the extreme. There is nothing that can be ruled out as "doesn't really matter". Everything has a potential effect on the product. Everything matters, no matter how tiny.

The management methodologies deal predominantly with measurement and attitude. Since you cannot really predict what whacko thing will mess up your quality target, you need to cultivate an attitude where everyone is alert and nobody considers anything as "doesn't really matter".

We have taken that attitude to heart during our revision and design. To us, every tiny little whacko thing matters. If we can avoid only one bug in 1 million lines of code by reducing every imaginable ever so tiny opportunity for error/defect, then it will have been worthwhile the bother.

Before this background, prefix literals belong to the more important changes in our revision and they will have a significant impact, regardless of what some people who have expressed opinions to the contrary here think. You are entitled to your opinion, but can you present any scientific research to back it up? We have spent quite a bit of reading over the yeaars on various research that has led us to accept that these types of design decisions have an impact significant enough to bother.

Last but not least, mental load is something that adds up, little by little. Small distractions lower our ability to handle mental load. Being tired or frustrated about something will further lower it. An ever so slightly distracting feature in a notation that didn't cause a bug on a good day will cause a bug on a not so good day. It is all about reducing opportunity for error/defect.
Post by r***@gmail.com
I still prefer 0--h for hex, but that's just me.
Not all suffixes are equally bad. I'm afraid to say so, but classic Modula-2's suffix literals feature an accumulation of the worst possible design choices.

Digits 0-9, capital A-F and H all have a similar visual footprint, using the full ascent and filling every corner of the bounding box. This makes the H suffix blend in with the digits. It doesn't stand out at all.

By contrast, in traditional assembly notation, where the base-16 specific digits A-F are also capitalised but a lowercase h is used, the h has a much smaller footprint because although it uses the full ascend, its ascender is but a single arm and it doesn't fill every corner of the bounding box. This makes the lowercase h stand out significantly better than the capital H.

In other words, 123Fh is far better readable than classic Modula-2's 123FH.

Likewise, 0abcdh is far worse than 0ABCDh.

For the same reason, 0xabcd and 0XABCD are both far worse than 0xABCD.

And the same holds true for decimal numbers with exponents: 1.23e45 is better than 1.23E45.

The low readability is greatly exacerbated by the fact that the suffixes for base-8 literals are also valid digits in base-16 literals in classic Modula-2. This is the WORST design for literals of any kind in any programming language.

This could of course be eliminated by simply removing base-8 literals, but then you still need to switch from H to h and once you have done that, its incompatible anyway, so you may as well go a little further still and use prefix literals. As I said, it is all about reducing every possible opportunity for distraction and ultimately error/defect.

Also, in embedded development (which is one area we specifically wanted to support) it is quite common to use binary literals. When you add those, then there is the question what suffix to use. If the letter B is to be used, again, 0110b is far better than 0110B, especially so since B is also a valid base-16 digit. But here again, once we've broken compatibility we may as well go the whole distance and use prefix literals.

Last but not least, you want to use a different prefix or suffix again for character code points. If you were to use the letter C, it should at least be lowercase: 040c is far better than 040C for the aforementioned reasons. But if you swich the radix and keep the same suffix, there is great potential for confusion about the radix. In the days of unicode we may as well use the letter U though and again, lowercase is better than uppercase for prefixes/suffixes. 040u or 0u40 is better then 040U or 0U40.
Post by r***@gmail.com
0x and $ in Intel assembly, but I always found it odd. The overwhelming
majority seems to prefer 0--h there. Of course, I also shun AT&T syntax,
but some people still prefer that (even though GAS has supported both
since many years), so who knows.
With non-letter prefixes or suffixes it is difficult to find three separate and distinct symbols (base-2, base-16, unicode) you could use as prefix or suffix and still have some mnemonic value to hint at the meaning. However, non-letter prefixes and suffixes have the advantage that they stand out more. There you have to make a trade-off between mnemonic value and visual cue value.
Post by r***@gmail.com
How does Ada do it? A quick search shows 16#FF# for 255. Similar (but
not quite) to Extended Pascal (16#FF) and Modula-3 (16_FF).
Indeed, Ada, Extended Pascal and Modula-3 all use PREFIX literals.

Their designers all made that "totally bogus" choice.

Perhaps it was the other way round and Modula-2 took the odd choice here.
r***@gmail.com
2016-09-03 23:47:59 UTC
Permalink
Hi,
Post by trijezdci
Post by r***@gmail.com
Overall, it doesn't really matter. Small changes like this are
the least of anyone's worries.
It would seem totally whacko but this is the kind of thing you
will observe when taking quality to the extreme. There is nothing
that can be ruled out as "doesn't really matter". Everything has
a potential effect on the product. Everything matters, no matter
how tiny.
I still don't think it's worthy of major attention or worry. Bugs
happen, but you're more likely to run into other issues (e.g.
mishandled dynamic memory) than this.

But, in hindsight, I agree that these problems can occur
if you're not careful.

Let me just show a small example from one old assembler:

===========================================================
; A86
org 100h ; hex

mov ax,111b ; binary
mov bx,0111b ; leading zero is hex (default, but +D uses binary)
mov cx,0111xb ; binary
mov dx,111bxh ; hex

nop

;radix 10
mov ax,5150 ; decimal

radix 16
mov ax,5150 ; hex

radix 8
mov ax,5150 ; octal

ret
===========================================================

So the problem is several-fold:

* trying to be parsimonious (as dmr remarked about both himself and Wirth)
* trying to be (mostly) compatible
* trying to be unambiguous
* having defaults that can be (obscurely) changed
* non-standard (thus unfamiliar) extensions

Most assemblers handle "0011b" as binary just fine. I guess EJI
considered it a waste of space to specify leading zeros for binary.
Thus, he decided (by default, although this can be changed with
cmdline switch) to make leading zero indicate hex! AFAIK, nobody
else does that. Perhaps since 0[A-F]\+[hH] needs it anyways that
he figured he could save the 'h' suffix entirely. Of course, he
also allowed unambiguous 1100xb, but that's unsupported in
most other assemblers.

So what do you do? Well, for one old piece of code that I
was sharing between assemblers, I just always used hex,
no binary at all. Of course, you can always use a third-party
preprocessor (or sed or whatever) before-the-fact to translate
minor stuff like this. But it seemed easier to just use hex.
(Especially since char/word literals like 'DC' were in
different endian order depending on assembler, so that was
problematic as well.)

Assemblers themselves are (mostly) case insensitive (like Pascal),
so it's not true that 'h' is always preferred (even if I partially
agree with you).

BTW, although it was supported, I never saw anybody use octal.
Oh, just for the record, octal supposedly helps with decoding x86
opcodes, but most people (myself included) don't see much
convenience there. Even octal dump (od), I normally just use in
hex (-v -Ax -tx1). (Actually, I wrote my own, but hex output only!)

I don't know if any of this proved any points, but I still thought
it was interesting.
trijezdci
2016-09-04 09:35:59 UTC
Permalink
Post by r***@gmail.com
I still don't think it's worthy of major attention or worry. Bugs
happen, but you're more likely to run into other issues (e.g.
mishandled dynamic memory) than this.
But, in hindsight, I agree that these problems can occur
if you're not careful.
The point is that while insignificant looking items may have less impact when they are seen in ISOLATION, they have a significantly higher impact when PUT TOGETHER. The sheer number of little things contribute significantly to the total. Therefore, it is worthwhile also paying attention to minute details.
Post by r***@gmail.com
mov ax,111b ; binary
mov bx,0111b ; leading zero is hex (default, but +D uses binary)
Well, that's not even an argument about prefix or suffix because it uses both prefix and suffix at the same time and they are in conflict with each other.

That's just bad design. A design should stick to either prefix or suffix.
Overall, I would consider inconsistency to be the biggest contributor to opportunity for error, so the first rule should be to be consistent and to follow the principle of least surprise.
Post by r***@gmail.com
* trying to be parsimonious (as dmr remarked about both himself and Wirth)
Certainly nobody can exuse us of having been parsimonious. We kept revisiting and refining even minute details again and again.
Post by r***@gmail.com
* trying to be (mostly) compatible
Indeed, backwards compatibility is more often a curse than it is a blessing.
And if you break it, better break it for good, don't make it look like it is still is compatible when it actually isn't. This comes down to the principle of least surprise again.

Our design principles expressly state that backwards compatibiity must not interfere with any other design goals. It has the lowest priority by definition.
Post by r***@gmail.com
* trying to be unambiguous
I disagree on that one. Ambiguity is a very significant contributor to error. The problem arises when you allow other less important goals such as backwards compatibility to interfere and the only way to reconcile conflicting design goals is to produce an unnecessarily complex and confusing design.
Post by r***@gmail.com
* having defaults that can be (obscurely) changed
Aka violating the principle of least surprise.
Post by r***@gmail.com
* non-standard (thus unfamiliar) extensions
Standard does not always mean familiarity. De facto standard perhaps. But the de facto standard for number literals is the 0x, 0u, 0b prefix convention. Many more languages use that than any other. And millions of practitioners across a large spectrum of languages are familiar with it. By contrast, there are only a few hundred people on the entire planet who are familiar with the literals of classic Modula-2's. Thus, this would be a strong argument in favour of the former.
Post by r***@gmail.com
Assemblers themselves are (mostly) case insensitive (like Pascal),
so it's not true that 'h' is always preferred (even if I partially
agree with you).
I didn't say it is preferred. I said 0123Fh is better than 01234FH. Even if the translator itself is case insensitive, using a lowercase h even in the presence of capitalised A-F digits is certainly a widely used convention. By contrast, Modula-2 being case sensitive, does not allow you to use that convention.
Post by r***@gmail.com
BTW, although it was supported, I never saw anybody use octal.
Octal notation was only ever useful on the 12-bit, 18-bit and 36-bit architectures of the 1950s and 1960s because the bit widths of character codes, words and addresses on these systems were all divisible by three. They could be presented in groups of three bits and each 3-bit group could be conveniently expressed in base-8.

6-bit character codes:
00 = 000 000
77 = 111 111

12-bit words and addresses:
0000 = 000 000 000 000
7777 = 111 111 111 111

18-bit words and addresses:
000:000 = 000 000 000 | 000 000 000
777:777 = 111 111 111 | 111 111 111

36-bit addresses:
0000:0000:0000 = 000 000 000 000 | 000 000 000 000 | 000 000 000 000
7777:7777:7777 = 111 111 111 111 | 111 111 111 111 | 111 111 111 111

As you can see, using base-8 was as convenient and useful back then as base-16 is to us today.

But this all changed when architectures moved to register sizes and addresses based on multiples of eight.

for 8-bit character codes:
0 0 0 = 00 000 000
3 7 7 = 11 111 111

is far less convenient and useful than

00 = 0000 0000
FF = 1111 1111

for 16-bit words and addresses:
0 0 0 0 0 0 = 0 000 000 000 000 000
1 7 7 7 7 7 = 1 111 111 111 111 111

is far less convenient and useful than

0000 = 0000 0000 0000 0000
FFFF = 1111 1111 1111 1111

and it gets more and more inconvenient for larger multiples of eight.


However, the programming language designers of the 1960s and 1970s, including Wirth and Ritchie, learned their trade on architectures with 6-bit character codes and 12-bit, 18-bit and 36-bit addressing. Base-8 notation was as natural to them as Base-16 notation is to us today. Their inertia clinging on to base-8 for longer than necessary is thus not surprising.

Also, C was initially developed on a PDP-8 (12-bit architecture) where it made sense to use base-8. By contrast, Modula-2 was first developed on a PDP-11 (16-bit architecture) where it made no sense to use base-8. The adoption of base-8 literals in Modula-2 comes down to inertia.

Of course Wirth later omitted base-8 in Oberon. Thirty years on, it is about time we removed base-8 from all languages. In our day and age it has no practical use whatsoever. It has become a ridiculous artefact.
r***@gmail.com
2016-09-07 22:25:46 UTC
Permalink
This post might be inappropriate. Click to display it.
Marco van de Voort
2016-09-08 02:53:33 UTC
Permalink
Post by r***@gmail.com
I mostly agree, though I'm sure someone somewhere can still make
a case for octal. But surely it shouldn't be preferred over hex.
The only non freak reason I can think of are Unix permissions.
trijezdci
2016-09-08 03:08:33 UTC
Permalink
Post by r***@gmail.com
Post by trijezdci
Post by r***@gmail.com
mov ax,111b ; binary
mov bx,0111b ; leading zero is hex (default, but +D uses binary)
Well, that's not even an argument about prefix or suffix because
it uses both prefix and suffix at the same time and they are in
conflict with each other.
I wouldn't consider '0' a prefix for binary. Most assemblers treat
it as part of the number, not a special char. My point was that I
assume he thought extra 0000 (before any 1) was superfluous, thus
the (convenient? confusing!) adjustment for other bases (which
thankfully can be overriden).
I don't want to get stuck on terminology here.

My advocacy is for the LL(1) principle.

A notation that lets you discover the type of a compound symbol at the start of the symbol is superior to a notation that forces you to read all the atomic symbols of the compound symbols until you reach the end before you know the type. That's the gist of it.

In left-to-right writing systems, this means whatever indicates the type of the literal should be leftmost. In right-to-left writing systems (such as Arabic) it means whatever indicates the type of the literal should be rightmost.

The translation of my earlier statement
Post by r***@gmail.com
Post by trijezdci
That's just bad design. A design should stick to either prefix or suffix.
is thus: A notation where the type of a compound symbol is indicated both by the leftmost and rightmost atomic symbols where the two can be in conflict is just bad design. A notation should stick to either leftmost or rightmost.
Post by r***@gmail.com
Post by trijezdci
Indeed, backwards compatibility is more often a curse than it is
a blessing. And if you break it, better break it for good, don't
make it look like it is still is compatible when it actually isn't.
This comes down to the principle of least surprise again.
I consider compatibility a noble virtue, but sometimes the burden
is too heavy. For small projects, it's probably easier to eschew
any formalities and just hack out whatever works.
My observation has been that backwards compatibility is far more costly than breaking clean and provide a translator tool to convert legacy code or data.

When backwards compatibility conflicts with what would be good design decisions, they incur what is know as technical debt. And like financial debt, you incur interest on technical debt. Also, the longer you carry the debt around, the more expensive the interest payments.

A translator tool and conversion of legacy code and data is equivalent to paying off the debt early and be free from interest payments in the future.
Post by r***@gmail.com
But overall, I'd prefer (de facto or de jure) "standard". This
is why some compilers support various competing dialects. Sure,
it's harder to achieve good compatibility (accept good code,
reject bad code), but overall I think it's worth it.
(GPC and FPC are good exemplars of this virtue.)
Compatibility across compilers was more important in the days when compilers were expensive proprietary products that often only worked on a single platform and targeted a single platform. This way you had to use a different compiler from a different vendor to deploy to a different target platform.

In the days of open source compilers (or compiler infrastructure backends) which are written to run on multiple platforms and generate code for multiple platforms this is no longer as important as it once was.

Our bootstrap compiler (or compiler suite) supports PIM3 and PIM4 because the bulk of Modula-2 literature is based on PIM3 and PIM4 and those books are neither going to be updated nor are they going to be replaced any time soon. But for bootstrapping to M2R10 we added an extended mode with select features from R10 because it is more convenient to write the self-hosting compiler in a subset of itself.

Yet if you want cross platform portability, you will not find any M2 compiler that can be deployed to and target as many platforms.

M2C is written in C99 and generates C99 which makes it already very portable, but for targeting the JVM and the CLR I have begun cloning two derivative compilers, M2J (transcribed to Java and generating Java) and M2Sharp (transcribed to C# and generating C#).

The self-hosting compilers bootstrapped from them will generate LLVM, JVM bytecode and CLR bytecode respectively. Plus, Gaius Mulley has pledged to implement M2R10 in GM2, thereby also providing access to gcc targets.

It doesn't get any more portable than this. Thus, the need to support different dialects of past and present M2 compilers is greatly diminished.
Post by r***@gmail.com
Post by trijezdci
Our design principles expressly state that backwards compatibiity
must not interfere with any other design goals. It has the lowest
priority by definition.
Yes and no. I understand sometimes it's a heavy burden to shoehorn
everything into a small compatible subset. But overall I'm tired of
non-portable code.
Compiler-specific code isn't much better than assembly (regarding
portability). Anything that's worth doing is probably worth doing
portably (to as many targets as possible).
As I mentioned, it is far less expensive to build a converter to convert from other dialects. In fact, a PIM to R10 converter could be cloned from the existing M2C front end with relatively moderate effort. One of the most time consuming aspects of M2C was figuring out how to generate readable C output while trying to stick as much as possible to C naming conventions. How do you translate Modula-2 identifiers of a hierarchical namespace to C identifiers in a flat namespace? That had me bogged down for months and I am only now getting back to working on the actual code generator. When generating Modula-2 of one dialect from another, identifiers can be used verbatim, and structurally everything except variant records is a 1:1 mapping, thereby greatly reducing the effort required.

All the while, the self-hosting compilers will be much simpler as a result because they do not need to carry around code for different dialects, nor different backends except for the LLVM one which will also generate C. They will share the same source code for their front end though.
Post by r***@gmail.com
Post by trijezdci
Post by r***@gmail.com
* trying to be unambiguous
I disagree on that one. Ambiguity is a very significant contributor
to error.
I think you misunderstood (due to my poor wording). I'm in no way
advocating for ambiguity as a good thing.
Fair enough. I had a bit of trouble figuring out what you wanted to say. It didn't quite seem to fit. ;-)
Post by r***@gmail.com
Post by trijezdci
But the de facto standard for number literals is the
0x, 0u, 0b prefix convention. Many more languages use that than
any other. And millions of practitioners across a large spectrum
of languages are familiar with it.
Let's not overstate the universality of it. Nothing is totally
widely accepted everywhere.
I didn't say it is. But the 0x convention is several orders of magnitude more widespread than Wirth's H and X suffixes.
Post by r***@gmail.com
Post by trijezdci
By contrast, there are only a few hundred people on the entire
planet who are familiar with the literals of classic Modula-2's.
Again, this might be a bit exaggerated (but the number has probably
decreased heavily since the '80s).
Maybe there are more than a thousand Modula-2 practitioners still around. Maybe there are more than a thousand Oberon practitioners around. Maybe the combined number is 5000. I think that would be extremely generous but even in that case, it is still several orders of magnitude less than the combined practitioners of languages using the 0x convention. And there ARE millions of Java/Csharp/C++ developers. Drone factories calling themselves universities are spitting them out in an assembly line process on an industrial scale. I am not exaggerating.
Post by r***@gmail.com
I mostly agree, though I'm sure someone somewhere can still make
a case for octal. But surely it shouldn't be preferred over hex.
I did make the case for octal. The only case there is to make: When using architectures whose register size and addresses are based on multiples of six.

There is no case to be made for octal on architectures whose register size and addresses are based on multiples of eight. Well, except when you are writing an emulator of an architecture that uses multiples of six. :P
Loading...