Re: [Air-L] Information wants to be ASCII or Unicode? Tibetan-written information cannot be ASCII anyway.

17 Jul 2009

      Using Wikipedia as a case to further the discussion

(1) The history of Wikipedia logo: From English only to International 
identity .....and some mistakes along the way...
http://en.wikipedia.org/wiki/Wikipedia:Wikipedia_logos
http://meta.wikimedia.org/wiki/Wikipedia/Logo

(2) Unsung hero (in my personal view, open to debate) Autrijus Tang's 
effort in Perl Internationalization
http://www.perl.com/pub/a/2005/09/08/autrijus-tang.html
Tang is a Taiwanese hacker.

(3) Unicode's support in Wikipedia
I have problem to locate the version control file to see when Unicode 
began to be supported and fully supported. 
http://meta.wikimedia.org/wiki/Wikipedia_timeline   (not mentioning 
Unicode here)
However, according to the entry of "Chinese Wikipedia" in English 
Wikipedia, we have the following paragraphs:

==========================

The Chinese Wikipedia was established along with 12 other Wikipedias in 
May 2001. At the beginning, however, the Chinese Wikipedia did not 
support Chinese characters 
<http://en.wikipedia.org/wiki/Chinese_character>, and had no 
encyclopedic content.
It was in October 2002 that the first Chinese-language page was written, 
the Main Page <http://zh.wikipedia.org/wiki/>. The first registered user 
of the Chinese Wikipedia was Mountain. A software update 
<http://en.wikipedia.org/wiki/Software_update> on October 27 
<http://en.wikipedia.org/wiki/October_27>, 2002 
<http://en.wikipedia.org/wiki/2002> allowed Chinese language input. .....
In order to accommodate the orthographic differences between simplified 
Chinese <http://en.wikipedia.org/wiki/Simplified_Chinese> and 
traditional Chinese <http://en.wikipedia.org/wiki/Traditional_Chinese> 
(or Orthodox Chinese), from 2002 to 2003, Chinese Wikipedia community 
gradually decided to combine the two originally separate versions of 
Chinese Wikipedia. The first running automatic conversion between the 
two orthographic representation starts from December 23, 2004, with 
MediaWiki 1.4 release. The needs from Hong Kong and Singapore were taken 
into accounts in MediaWiki 1.4.2 release, which made conversion table 
for zh-sg default to zh-cn, and zh-hk default to zh-tw.^[2] 
<http://en.wikipedia.org/wiki/Chinese_Wikipedia#cite_note-1>

^==========================

Overall, from the above evidence, it could be argued that Wikipedia's 
internationalization is a clear effort to adopt the Unicode standards by 
mostly the Unicode-needed crowd.  It is worth pointing out that around 
2001 and 2002, the major operating systems such as Microsoft and Mac 
that most normal PC users used at that time seem to be not Unicode 
available yet, which makes such development in Wikipedia more interesting.

Again, coming back to the original question.  Why Wikipedia wants to be 
Unicode?  or....Why not Wikipedia choose other solutions to deliver 
interoperability?

-- 
Han-Teng Liao
PhD Candidate
Oxford Internet Institute
http://www.oii.ox.ac.uk/people/students.cfm?id=123

Han-Teng Liao (OII) wrote:
...
Running the risk of taking your comments out of the context, I have 
listed the following responses.
Mike Stanger wrote:
...
......The use of Unicode believing that it solves the 
interoperability issues and/or is a communication about the intent of 
the programmer is much the same sin, in my view.
Not sure about "not using" Unicode can solve the interoperability 
issues.  If the use of Unicode is one of the more attractive solutions 
that can deliver some interoperability solutions (as Google, 
Wikipedia, Youtube, etc. try to do, then I do not know whether the two 
belief is "much the same sin".
...... However, just using unicode isn't going to resolve all of the 
interoperability issues (eg. reading direction, and other unique 
features of the written form of a particular language, etc.). 
Agree, using Unicode by itself cannot save the world. Still, do you 
mind showing me not using Unicode or other alternatives would solve 
the issues better?  If such solution or vision does exist, why Google, 
Wikipedia, Microsoft, Linux, Mac, etc., adopts the Unicode?  I am not 
citing these examples to refute your argument.  I am genuinely 
intrigued to find out why they come to certain solution but not others 
(including maintaining the status quo by not deploying Unicode to some 
extent).
Ultimately though, what data storage in Unicode does provide almost 
automatically is the preservation of the appropriate data (unless it 
gets transformed of course), and its use /could potentially/ signal 
the intent by the author to enable the coexistence of mixed language 
content as a politically friendly gesture.  I would agree that 
character encodings could potentially send a signal about the 
/intent/ to be good internet citizens, or that the /intentional/ use 
of something other than unicode could be seen as a statement of 
political position (eg. mainland China's use of jianti character sets 
in a particular code page vs. a codepage that supported fanti). 
Agree, good will matters.  Still, efforts to deliver that good will 
matter as well.  I will exhibit some evidence in another email that 
inside Perl (the programming language that supports MediaWiki which 
makes Wikipedia possible) and the logo of Wikipedia and Chinese 
Wikipedia, most of the efforts are requested and done by those who 
need Unicode support.  Then it is not only a picture of good will but 
some kind of push and pull.
...
However, I think often programmer intent is lost in the end-product. 
 It would be encouraging to see a movement where programmers stated 
that their /active decision/ to use Unicode is a deliberate 
recognition of the multitude of languages as a 'politically friendly' 
gesture.
Politically friendly or politically correct could be a bit 
patronizing.  I will argue that Wikipedia benefits more from other 
language versions (ranking higher in search results, better webometric 
position, etc.).
I also assume that there are many coders who are using unicode, but 
doing so less than deliberately, perhaps even as a side-effect of the 
development environment that they use (eg. Java's native 
character/string support), /mirroring the use of ASCII in earlier 
environments/. These applications may well support Unicode at the 
character level, but because the programmer's use of Unicode is a 
sort of side-effect, the end product may not actually interoperate 
with other languages properly or completely.
So while I agree that the use of Unicode is a step forward in 
interoperability, I'd argue that the work to be done is not so much 
about the use of Unicode, but the '/publicly' stated intent to be 
interoperable./ Unicode may be one tool that can assist in that goal 
if used properly, but the use of Unicode alone says little about intent.
I slightly disagree on the meaning of interoperability.  If 
interoperability means a certain linguistic space can still use a 
non-Unicode standard, then it may create a linguistic hierarchy.  For 
example, Chinese can use GB2312 through out in their user-generated 
websites, and then Tibetans and traditional Chinese characters cannot 
have a voice.  Again imagine Youtube cannot automatically take the 
content contributed by Arabic or Persian users, but only some kind of 
"interfaces" to promise the interoperability.  To me it is not about a 
full support of Unicode at this moment, but it is the awareness that 
the fact that Unicode is arguably the most open linguistic 
infrastructure receives little attention.
Then the sharp question will be, can Beijing, Washington, London, 
Tokyo deliver their government services and communicative spaces by 
sticking to their linguistic ghetto without using Unicode or other 
open linguistic architecture?