A Framework for the Localization of Programming Languages

Research

A Framework for the Localization of Programming Languages

We have a new paper coming out at SPLASH-E in October. You can now read the final paper, and find the paper in the ACM library (doi: 10.1145/3622780.3623645).

Programming is English

I have been working on programming education for quite some time, and while I have studied several ways in which that is hard, I never fully understood all of it. For example, when kids start to program, very often the first program they are asked to create is adding two number, for example in The Coder’s Apprentice:

A screenshot from the book The Coder's Apprentice showing the first programming exercise to be adding 5 and 7 in the Python shell

Me, growing up with Dutch and later English, saw no issue at all in this. However, my former colleague and friend (and first author of the SPLASH paper!) Alaaeddin Swidan pointed out to me, early when we started to work on Hedy, many people in the world use very different characters for numbers!

I had no idea, never in school, or in the few compiler courses I took in university, did they teach me that different languages use different numerals. (Fun fact: in the Netherlands, it is mandatory for all kids aged 12 to learn Roman numerals, you know to converse easily with clocks and buildings, but not numeral systems that people use!).

Did you know about Arabic Numerals?

We tend to call the numbers 0 to 9 “Arabic numerals” (formally they are called Hindu-Arabic numerals) but actual modern day Arabic uses different numerals, these:

0123456789
٠١٢٣٤٥٦٧٨٩
Numerals used in Arabic

So let’s go back to our programming example, let’s try the Arabic numerals in the Python shell:

Python cannot proces 5+7 in Arabic numerals. It errors out quite excessively!

Compilers are made with cultural assumptions

This is not a Python issue, C++ errors out in a similar way:

By the way it is not just Arabic that uses non 0-9 numerals, Hindi uses these:

0123456789
Numerals used in Hindi

And there are a lot of other systems in use.

It is interesting to dive into why languages aren’t properly supporting numerals. In principle, it would be a very easy addition, numerals aren’t used in other places as separators for example. Most likely it will not confuse most parsers if numerals would be numeral : 0-9|٠-٩|०-९

Why don’t we just add that? When Alaaeddin got me curious about this topic, I looked at some compiler theory. Let’s look at… the Dragon Book (edition 2 from 2007):

In the whole Dragon book, digits and numerals are 0 to 9, and nowhere do they mention that there are different ones. Even newer works like the lovely Crafting Interpreters do not do better. So of course, programming languages designers too lack this knowledge. Certainly I did and I was even building a language for novices!

What parts of programming languages can be made more culturally inclusive?

If numbers aren’t really well supported in many programming languages, what other aspects of different cultures could we be overlooking? That is the question of out paper that is going to appear in SPLASH in a few weeks:

What aspects of programming languages can be localized to different languages and cultures. To answer that question, we gathered a set of aspects that can be localized, and find these aspects in a set of on-English programming languages The remainder of this post will explain what features we found in these languages. The fact that, for each of the aspects, we can find a language that supports it, demonstrates that these features are both feasible to build and required by some users.

What parts of programming languages can be made more culturally inclusive?

Here, we are presenting the aspects in the order of roughly easier to build to harder to build into existing languages.

Numerals

As explained above, many programming languages do not support numeral systems ofter than 0-9. This would not be hard to add, it can be done in a quite local production rule or rules dealing with numbers.

Punctuation

Similar to numbers, many languages use punctuation that is different from what is common in English. Arabic commas for example point to the right instead of to the left, and similarly question marks are mirrored compared to English ones. Some languages use different quotes, like guillement in French.

Error messages

Programming language design includes the design of error messages, which are also shown in English for most English language programming languages, which can be a substantial barrier for non-English users. Error messages, in principle, can be localized with relative ease, compared to keywords or productions.

Alternative keywords

Traditionally, programming languages use exactly one keyword for a concept. However, this limitation makes it hard to fully support a broad range of natural languages. One example where a user might need the support of multiple keywords for one concept is in the case of gendered words. Arabic for example has gendered nouns; a noun can be either masculine or feminine: هو if it follows a masculine variable name but as هي when the variable name is feminine. Grammars could support different options, although this might negatively impact performance.

Diacritics

Many languages make use of diacritics to modify letters, such as the accent accute (on the a: á), accent grave (à) and accent circumflex (â) in French. These accents change the meaning of words: là’ means there, while la means the.

When designing a programming language for a language with accents, such as French, an interesting open question is how to handle diacritics in various cases. One of such cases is keywords, if a keyword is translated with a word that has accents, how to thread the equivalent word without diacritics? For example, in French, repeat’ is translated as répète’, which is used in some programming languages (such as Hedy and Quorum). If répète 3 fois is the correct code, do we also allow repete 3 fois or will that lead to an error message?

If we allow all characters in variable names including letters with diacritics, another problem arises, similar to case sensitivity, which we could call diacritic sensitivity: do words with and without diacritics quantité and quantite refer to the same variable, or to different ones?

Non-letter characters

So far, we have looked at letters and numbers, but some languages have other characters that could be supported to. Arabic for example has a character called tatweel, that can be used to prolong words, like this:

A tatweel is used to make writing prettier, by aligning words. In a programming languages, tatweels could be supported in keywords, and could be used in variable names too, introducing the additional concept of tatweel sensitivity. Is a variable with and without tatweels in it, the same variable?

Alignment

A programming language that supports non-English can be either a translation of an existing programming language, or a freshly designed language. Both have pros and cons: existing languages might have good support for editors, but some existing languages might have decisions that are harder to localize (examples given below).

Keywords

Keywords tend to be in English in many programming language. This could be localized with a local version of a grammar, and for some languages, like ALGOL, LOGO or BASIC non-English dialects have been created.

Variable names

Traditionally, many programming languages only supported variable names with a-z and A-Z letters. Python introduced non-ascii variable names in PEP3131 in 2007. However, many languages and frameworks still do not support variable names that aren’t a-z.

Productions

Many productions in programming language have an English “shape”. For example for fruit in basket or if temperature == 5. These productions could be made more natural sounding in the natural language, for example in German, one would not say wenn x 5 ist placing the verb at the end and in Korean one would not say turn left but left turn. This would require extensive changes to the grammar of an existing language.

Right to left support

There are multiple issues when it comes to supporting writing and editing text in right to left (RTL) languages. The first issue is the alignment: RTL text should be aligned correctly to the right side of the screen towards the left. Secondly, there is the direction of letters and words in a sentence. Thirdly, ligatures: in some RTL languages (like Arabic), letters have different forms depending on their position in a word to allow for connecting letters together. Many editors fail to respect this, and split the letters, rendering the text very hard or impossible to read. This is an issue more at the level of editors than at the level of programming languages, but we consider this in scope of programming.

Multi-lingual

Another aspect to take into account in designing a truly inclusive programming language experience is the use of multiple languages intermixed. Many people around the world are native speakers of not one but two or even three languages. (The BBC estimates that between 60 and 75% of people are bilingual). For bilingual people, being constrained to one of their native languages while programming is limiting, as they often switch languages when speaking or writing. This is hard to implement in programming languages since it might cause ambiguities, especially if different shapes of productions are allowed.

Back To Top