On Jul 15, 2009, at 5:04 PM, Han-Teng Liao (OII) wrote:
I agree that "Information wants to be digital", and that is why we should start a honest conversations among programmers, IT support, academics and policy makers. I disagree that the notion that the technical support of Unicode source is confusing for programmers. Please refer to the following blog post:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky http://www.joelonsoftware.com/articles/Unicode.html
Yes, I've read that many years ago when we started having a need to address codepages/Unicode/etc. :-) but I think that the article is primarily saying that the assumption/ignorance of a programmer that a given text string may have meaning in ASCII is problematic and that the use of Unicode is a better general solution for a number of reasons. The use of Unicode believing that it solves the interoperability issues and/or is a communication about the intent of the programmer is much the same sin, in my view. My comment wasn't intended to imply that it was confusing to implement your own software in Unicode, but that Unicode is still an encoding, and you have to deal with it, its assumptions, and the assumptions of the programmers and the assumptions of the same set of actors of any software or system with which you might interact; if one is intentionally trying to be interoperable it's a similar set of concerns as if you were if dealing with traditional codepages, one just uses different approaches to determine the ability to display the content accurately by the time the content makes it's way towards the ultimate recipient of the information. However, just using unicode isn't going to resolve all of the interoperability issues (eg. reading direction, and other unique features of the written form of a particular language, etc.). Ultimately though, what data storage in Unicode does provide almost automatically is the preservation of the appropriate data (unless it gets transformed of course), and its use could potentially signal the intent by the author to enable the coexistence of mixed language content as a politically friendly gesture. I would agree that character encodings could potentially send a signal about the intent to be good internet citizens, or that the intentional use of something other than unicode could be seen as a statement of political position (eg. mainland China's use of jianti character sets in a particular code page vs. a codepage that supported fanti). However, I think often programmer intent is lost in the end-product. It would be encouraging to see a movement where programmers stated that their active decision to use Unicode is a deliberate recognition of the multitude of languages as a 'politically friendly' gesture. I also assume that there are many coders who are using unicode, but doing so less than deliberately, perhaps even as a side-effect of the development environment that they use (eg. Java's native character/ string support), mirroring the use of ASCII in earlier environments. These applications may well support Unicode at the character level, but because the programmer's use of Unicode is a sort of side-effect, the end product may not actually interoperate with other languages properly or completely. So while I agree that the use of Unicode is a step forward in interoperability, I'd argue that the work to be done is not so much about the use of Unicode, but the 'publicly' stated intent to be interoperable. Unicode may be one tool that can assist in that goal if used properly, but the use of Unicode alone says little about intent. Mike
We can debate about the technical implementation on and on (but I hope the above link has settled the technical debate). However, it would be better to ask first whether we need, for example, Korean, Japanese and Chinese to *coexist* on the same page, or alternatively, Jewish and Arabic to *coexist* on the same page. If the social and communicative need across languages is among our priority to support a better Internet environment, then the answer is obvious. Again, the reason why Unicode is supported and maintained by industry and experts points to the fact that Google, youtube, facebook and other websites supports Unicode probably for a simple reason: they want to reach other local markets. My short-cut and simplified understanding of the whole software industry movement in i18n (internationalization) and l10n (localization) is as follows. The industry (and along with open source community who actually excels in i18n and l10n) has proposed Unicode by first imagining there is limitless space (codepoints) for alphabets/scripts/strokes/characters to be assigned. And then the industry can compete to implement them and satisfy any potential markets. So I am of the opinion that Unicode is actually market-friendly and potentially programmer-friendly. It takes more effort to make it politically-friendly rather than merely politically-correct. I hope my starting point is not about multicultural or multilingual correctness, but about an open nature of Internet....
All languages want to be digital. We have enough space for them.