Thursday, March 16, 2006

Eiffel and case-sensitivity

SmartEiffel has been case-sensitive (at least on the first character of identifier names) since its inception. For example, the compiler has always treated Therapist and theRapist as distinct identifiers.

Upon the urging of Eric Bezault, a "-case_insensitive" flag was added in 1997. This aided compatibility with other Eiffel compilers, but the flag was removed with release 2.2 in December 2005. Furthermore, the compiler now enforces certain conventions: a class name must be written using only upper case letters, and a warning is issued if the reserved words "Void", "Current", "Result", "True" and "False" are not spelled with an initial capital letter.

What are the reasons for this?

The SmartEiffel website explains:
This allows us to have better error messages, better error recovery. Furthermore, we think it is better too for readability of our code.
The downside, of course, is that this behaviour is at odds with every other Eiffel compiler, and with every Eiffel standard, and with every Eiffel textbook. It also clashes with the quirky but legal non-standard capitalization used in the acclaimed Gobo package.

And (on a selfish note) it clashes with my own code because I like to write my reserved words in lowercase.

No doubt better error messages and better error recovery are possible with a more constrained language. Also, in a 1997 discussion, Wouter Scholten praised case-sensitivity when wrapping xlib for SmallEiffel - because xlib includes C constants that have the same lowercase mapping (such as 'XK_a' and 'XK_A'), and Wouter wanted a one-to-one mapping rather than being forced to use Eiffel's "alias" keyword for his wrappings.

In this 1998 comp.lang.eiffel thread, Joachim Durcholz pointed out that case-sensitivity means that you can't unambiguously read code over the phone without specifying every letter-case. Graham Perkins pointed out that in the real world British Airways and BRITISH AIRWAYS are the same airline.

Eiffel isn't the only language to have this dilemma. In 2001, Python's designer Guido van Rossum considered changing Python to be case-insensitive, which he considered to be superior. In the end though, he relented:
...never mind, I'm giving up on making *Python* case-insensitive. The hostility of the user community is frightening.
At the end of a long discussion on the SmartEiffel mailing list last year, Richard O'Keefe reviewed the possibilities, concluding that case-insensitivity combined with error messages or warnings was the best available option:
Let's consider several alternatives:

(1) 'A' and 'a' are treated like completely different letters. Each identifier must be cased consistently, otherwise it would be a different identifier.

It is argued that this is confusing for people because case does not provide any clue about what kind of thing you are looking at.

Some case convention rules (such as requiring constant-like macros or static finals to be UPPER_ONLY) are clearly extremely bad ideas because they highlight a distinction that the programmer should be ignoring.

It is also argued that it is confusing for people if names that are pronounced the same have different meanings.

(2) As (1), except that the compiler will not allow an identifier to be declared in a particular scope if a similarly spelled but differently cased identifier is used in that scope.

This fixes the "same pronunciation but different meaning" problem. It does nothing about the "case is no clue" problem, but then, it's not clear how case *could* be a clue for, say, Japanese.

(3) As (2), except that the compiler also enforces a style rule, like Prolog's "lower case = constructor, upper case = variable" or Haskell's "lower case = variable, upper case = constructor".

This is fine, except that it *can't* work for many scripts. For English, there's the caseless Shavian alphabet, but who uses it? More realistically, in Hebrew and Arabic and Indic scripts such as Devanagari there is no case distinction, so if there is one group of words that MUST be spelled with a particular case and another group that MUST be spelled with another, what on earth can be done for those scripts?

(What Quintus did was to say that "_" was an upper case letter, and that non-cased script elements were lower case letters, so variables in non-cased scripts just had to start with "_". Some programming languages, like C, make it hard to adopt such a rule, and others, like Java, strongly discourage the use of "_" anywhere. One has to wonder what the Java designers thought they were playing at.)...

(4) 'A' and 'a' are the same letter. You can write BEGIN, begin, Begin, beGiN, or any other case pattern that takes your fancy,

It is argued that this is confusing for people because the same identifier may appear in many different forms making it hard to recognise visually.

This is "case insensitivity" and it is hard to see how it promotes readability.

(5) As (4), except that whenever an identifier is used, the compiler checks that it is used with the same case pattern it was declared with and warns if it isn't.

This fixes the "same pronunciation different meaning" and the "same identifier but different looks" problems.

(6) As (5), but the compiler strongly (with error messages) or weakly (with warning messages) enforces some style rules.

As far as I can see, (6) would be the best choice for an Eiffel system.

+ Recognising the same identifier, however cased, would maximise compatibility with other Eiffel systems.

+ Warning when an identifier is cased inconsistently would tend to improve readability.

+ Warning about the violation of style rules would tend to improve readability (provided the style rules don't insist on bad things like IDENTIFIERS_IN_ALL_CAPITALS or baStudlyCaps) but would also allow any necessary deviations.
Anyway, SmartEiffel's case-sensitivity is not likely to change. Dominique Colnet has said more than once that "SE will not reverse this decision".

No comments:

Post a Comment