Information wants to be ASCII or Unicode? Tibetan-written information cannot be ASCII anyway.

Han-Teng Liao (OII)

14 Jul 2009 14 Jul '09

7:21 a.m.

Dear all, Running the risk of trolling and misrepresenting the famous motto "Information wants to be ASCII", I want to raise the question of the difference between "Information wants to be ASCII" versus "Information wants to be Unicode" from a multilingual perspective. It should be pointed out when Lev Manovich declare "Information wants to be ASCII" when talking about remix and remixability of information, it was in 2005 when the adoption of Unicode was just in the early adoption period globally. So I do not intend to raise the question to make lazy criticism against the America-centric implication inside ASCII, but rather raise the question about remix and remixability across linguistic boundaries. Why the Unicode is not universally deployed yet? How can we measure the remixability across linguistic boundaries simply because the information are encoded not in Unicode? Why so many user-generated content websites in China are only using their simplified-Chinese-only kind of "national standard" (GB2312) even when Hong Kong (using traditional Chinese not included in GB2312) is part of China and Beijing claims Taiwan is part of China? What about Tibetan-written information: is it want to be Unicode or GB 18030-2000? Tibetan-written information cannot be ASCII anyway. I really like to hear from you. Best regards, -- Han-Teng Liao PhD Candidate Oxford Internet Institute http://www.oii.ox.ac.uk/people/students.cfm?id=123

Show replies by date

Mike Stanger

15 Jul 15 Jul

6:37 a.m.

New subject: Information wants to be ASCII or Unicode? Tibetan-written information cannot be ASCII anyway.

At the risk of sounding like an apologist for a particular linguistic- centricism, English or otherwise, from a programmer's standpoint there are challenges beyond simply the choice to use Unicode or some language specific codepage. Just using Unicode doesn't guarantee that the application viewing the content will have the appropriate fonts, for one, even if the proper unicode character sequences are sent (much as marking your pages as GB2312 doesn't give the end-user's machine the automatic ability to display the content), so it's questionable that the end-user usability will actually improve just by using Unicode and I would expect that at some levels it makes it more difficult to guarantee interoperability when the incoming stream is arbitrary content of an arbitrary language or set of languages. When I'm coding, I'm actually much more comfortable knowing that I have a specific codepage to address rather than just knowing that I'll have a Unicode stream, for example, because I'll know exactly what my application should support. Unicode really tells me nothing other than the content could be any known character, including the famous "snowman" symbol :-) If I'm trying to mash-up a site and my code sees that it's in GB2312 I can take appropriate steps to support it, or report back that the feed is incompatible. If I get a Unicode source, I have to be constantly aware that the feed might at some time have some requirements that I haven't yet addressed. I might suggest that rather than restricting the phrase to linguistic elements and suggesting that "Unicode" is a superior term to "ASCII" in this case, I'd broaden it out and say "Information wants to be Digital" -- I think that's more the heart of the matter, but the term ASCII conveys more meaning about language/etc. and likely helps makes the implication of the argument more direct. YMMV - There are of course libraries of routines to address such issues in code, but I think that actually points to the fact that sometimes Unicode is not a simple, direct answer to a problem as people might expect it to be. Mike On 14-Jul-09, at 12:21 AM, Han-Teng Liao (OII) wrote:

...

Dear all,

Running the risk of trolling and misrepresenting the famous motto "Information wants to be ASCII", I want to raise the question of the difference between "Information wants to be ASCII" versus "Information wants to be Unicode" from a multilingual perspective.

It should be pointed out when Lev Manovich declare "Information wants to be ASCII" when talking about remix and remixability of information, it was in 2005 when the adoption of Unicode was just in the early adoption period globally. So I do not intend to raise the question to make lazy criticism against the America-centric implication inside ASCII, but rather raise the question about remix and remixability across linguistic boundaries. Why the Unicode is not universally deployed yet? How can we measure the remixability across linguistic boundaries simply because the information are encoded not in Unicode? Why so many user- generated content websites in China are only using their simplified- Chinese-only kind of "national standard" (GB2312) even when Hong Kong (using traditional Chinese not included in GB2312) is part of China and Beijing claims Taiwan is part of China? What about Tibetan-written information: is it want to be Unicode or GB 18030-2000? Tibetan-written information cannot be ASCII anyway.

I really like to hear from you.

Best regards,

-- Han-Teng Liao PhD Candidate Oxford Internet Institute http://www.oii.ox.ac.uk/people/students.cfm?id=123

_______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org

Join the Association of Internet Researchers: http://www.aoir.org/

Han-Teng Liao (OII)

16 Jul 16 Jul

12:04 a.m.

I agree that "Information wants to be digital", and that is why we should start a honest conversations among programmers, IT support, academics and policy makers. I disagree that the notion that the technical support of Unicode source is confusing for programmers. Please refer to the following blog post: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky http://www.joelonsoftware.com/articles/Unicode.html We can debate about the technical implementation on and on (but I hope the above link has settled the technical debate). However, it would be better to ask first whether we need, for example, Korean, Japanese and Chinese to *coexist* on the same page, or alternatively, Jewish and Arabic to *coexist* on the same page. If the social and communicative need across languages is among our priority to support a better Internet environment, then the answer is obvious. Again, the reason why Unicode is supported and maintained by industry and experts points to the fact that Google, youtube, facebook and other websites supports Unicode probably for a simple reason: they want to reach other local markets. My short-cut and simplified understanding of the whole software industry movement in i18n (internationalization) and l10n (localization) is as follows. The industry (and along with open source community who actually excels in i18n and l10n) has proposed Unicode by first imagining there is limitless space (codepoints) for alphabets/scripts/strokes/characters to be assigned. And then the industry can compete to implement them and satisfy any potential markets. So I am of the opinion that Unicode is actually market-friendly and potentially programmer-friendly. It takes more effort to make it politically-friendly rather than merely politically-correct. I hope my starting point is not about multicultural or multilingual correctness, but about an open nature of Internet.... All languages want to be digital. We have enough space for them. Mike Stanger wrote:

...

At the risk of sounding like an apologist for a particular linguistic-centricism, English or otherwise, from a programmer's standpoint there are challenges beyond simply the choice to use Unicode or some language specific codepage.

Just using Unicode doesn't guarantee that the application viewing the content will have the appropriate fonts, for one, even if the proper unicode character sequences are sent (much as marking your pages as GB2312 doesn't give the end-user's machine the automatic ability to display the content), so it's questionable that the end-user usability will actually improve just by using Unicode and I would expect that at some levels it makes it more difficult to guarantee interoperability when the incoming stream is arbitrary content of an arbitrary language or set of languages.

When I'm coding, I'm actually much more comfortable knowing that I have a specific codepage to address rather than just knowing that I'll have a Unicode stream, for example, because I'll know exactly what my application should support. Unicode really tells me nothing other than the content could be any known character, including the famous "snowman" symbol :-) If I'm trying to mash-up a site and my code sees that it's in GB2312 I can take appropriate steps to support it, or report back that the feed is incompatible. If I get a Unicode source, I have to be constantly aware that the feed might at some time have some requirements that I haven't yet addressed.

I might suggest that rather than restricting the phrase to linguistic elements and suggesting that "Unicode" is a superior term to "ASCII" in this case, I'd broaden it out and say "Information wants to be Digital" -- I think that's more the heart of the matter, but the term ASCII conveys more meaning about language/etc. and likely helps makes the implication of the argument more direct.

YMMV - There are of course libraries of routines to address such issues in code, but I think that actually points to the fact that sometimes Unicode is not a simple, direct answer to a problem as people might expect it to be.

Mike

On 14-Jul-09, at 12:21 AM, Han-Teng Liao (OII) wrote:

...
Dear all,

Running the risk of trolling and misrepresenting the famous motto "Information wants to be ASCII", I want to raise the question of the difference between "Information wants to be ASCII" versus "Information wants to be Unicode" from a multilingual perspective.

It should be pointed out when Lev Manovich declare "Information wants to be ASCII" when talking about remix and remixability of information, it was in 2005 when the adoption of Unicode was just in the early adoption period globally. So I do not intend to raise the question to make lazy criticism against the America-centric implication inside ASCII, but rather raise the question about remix and remixability across linguistic boundaries. Why the Unicode is not universally deployed yet? How can we measure the remixability across linguistic boundaries simply because the information are encoded not in Unicode? Why so many user-generated content websites in China are only using their simplified-Chinese-only kind of "national standard" (GB2312) even when Hong Kong (using traditional Chinese not included in GB2312) is part of China and Beijing claims Taiwan is part of China? What about Tibetan-written information: is it want to be Unicode or GB 18030-2000? Tibetan-written information cannot be ASCII anyway.

I really like to hear from you.

Best regards,

-- Han-Teng Liao PhD Candidate Oxford Internet Institute http://www.oii.ox.ac.uk/people/students.cfm?id=123

_______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org

Join the Association of Internet Researchers: http://www.aoir.org/

-- Han-Teng Liao PhD Candidate Oxford Internet Institute http://www.oii.ox.ac.uk/people/students.cfm?id=123

Mike Stanger

5:23 p.m.

New subject: Information wants to be ASCII or Unicode? Tibetan-written information cannot be ASCII anyway.

On Jul 15, 2009, at 5:04 PM, Han-Teng Liao (OII) wrote:

...

I agree that "Information wants to be digital", and that is why we should start a honest conversations among programmers, IT support, academics and policy makers. I disagree that the notion that the technical support of Unicode source is confusing for programmers. Please refer to the following blog post:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky http://www.joelonsoftware.com/articles/Unicode.html

Yes, I've read that many years ago when we started having a need to address codepages/Unicode/etc. :-) but I think that the article is primarily saying that the assumption/ignorance of a programmer that a given text string may have meaning in ASCII is problematic and that the use of Unicode is a better general solution for a number of reasons. The use of Unicode believing that it solves the interoperability issues and/or is a communication about the intent of the programmer is much the same sin, in my view. My comment wasn't intended to imply that it was confusing to implement your own software in Unicode, but that Unicode is still an encoding, and you have to deal with it, its assumptions, and the assumptions of the programmers and the assumptions of the same set of actors of any software or system with which you might interact; if one is intentionally trying to be interoperable it's a similar set of concerns as if you were if dealing with traditional codepages, one just uses different approaches to determine the ability to display the content accurately by the time the content makes it's way towards the ultimate recipient of the information. However, just using unicode isn't going to resolve all of the interoperability issues (eg. reading direction, and other unique features of the written form of a particular language, etc.). Ultimately though, what data storage in Unicode does provide almost automatically is the preservation of the appropriate data (unless it gets transformed of course), and its use could potentially signal the intent by the author to enable the coexistence of mixed language content as a politically friendly gesture. I would agree that character encodings could potentially send a signal about the intent to be good internet citizens, or that the intentional use of something other than unicode could be seen as a statement of political position (eg. mainland China's use of jianti character sets in a particular code page vs. a codepage that supported fanti). However, I think often programmer intent is lost in the end-product. It would be encouraging to see a movement where programmers stated that their active decision to use Unicode is a deliberate recognition of the multitude of languages as a 'politically friendly' gesture. I also assume that there are many coders who are using unicode, but doing so less than deliberately, perhaps even as a side-effect of the development environment that they use (eg. Java's native character/ string support), mirroring the use of ASCII in earlier environments. These applications may well support Unicode at the character level, but because the programmer's use of Unicode is a sort of side-effect, the end product may not actually interoperate with other languages properly or completely. So while I agree that the use of Unicode is a step forward in interoperability, I'd argue that the work to be done is not so much about the use of Unicode, but the 'publicly' stated intent to be interoperable. Unicode may be one tool that can assist in that goal if used properly, but the use of Unicode alone says little about intent. Mike

...

We can debate about the technical implementation on and on (but I hope the above link has settled the technical debate). However, it would be better to ask first whether we need, for example, Korean, Japanese and Chinese to *coexist* on the same page, or alternatively, Jewish and Arabic to *coexist* on the same page. If the social and communicative need across languages is among our priority to support a better Internet environment, then the answer is obvious. Again, the reason why Unicode is supported and maintained by industry and experts points to the fact that Google, youtube, facebook and other websites supports Unicode probably for a simple reason: they want to reach other local markets. My short-cut and simplified understanding of the whole software industry movement in i18n (internationalization) and l10n (localization) is as follows. The industry (and along with open source community who actually excels in i18n and l10n) has proposed Unicode by first imagining there is limitless space (codepoints) for alphabets/scripts/strokes/characters to be assigned. And then the industry can compete to implement them and satisfy any potential markets. So I am of the opinion that Unicode is actually market-friendly and potentially programmer-friendly. It takes more effort to make it politically-friendly rather than merely politically-correct. I hope my starting point is not about multicultural or multilingual correctness, but about an open nature of Internet....

All languages want to be digital. We have enough space for them.

Joseph Reagle

7:03 p.m.

New subject: Information wants to be ASCII or Unicode? Tibetan-written information cannot be ASCII anyway.

On Thursday 16 July 2009, Mike Stanger wrote:

...

My comment wasn't intended to imply that it was confusing to implement your own software in Unicode, but that Unicode is still an encoding, and you have to deal with it, its assumptions, and the assumptions of

Just as a bit of evidence of how difficult it can be to grok character issues: Unicode is not "an encoding" itself, but a repertoire of characters, their names, and (abstract) code points (i.e., UCS), plus a set of encodings (i.e., UTF-8, UTF-16), extra properties, and algorithms. And I'm sure a Unicode geek could pick some wholes in what I've said! Unfortunately, I've spend lots of time wrapping my head around these issues in XML and Python. The character repertoire is still growing, imagine what that can mean for digital signature on XML documents. Or, how easy it is to trick people to think they are going to a URL they know when you can pull off character hijinks with IRIs. Dealing with byte-order-marks (BOMS), transcoding, character decomposition, etc. *are* confusing to implement. So it's not just a matter of lazy westerners. I personally look forward to the day when I can use Python 3.* as it is only now that we are finally moving into a Unicode world. (Something I was just dealing with today, not even bibtex8 can deal with all of Unicode for those of us who use LaTeX.)

Mike Stanger

7:14 p.m.

New subject: Information wants to be ASCII or Unicode? Tibetan-written information cannot be ASCII anyway.

...

Just as a bit of evidence of how difficult it can be to grok character issues: Unicode is not "an encoding" itself, but a repertoire of characters, their names, and (abstract) code points (i.e., UCS), plus a set of encodings (i.e., UTF-8, UTF-16), extra properties, and algorithms. And I'm sure a Unicode geek could pick some wholes in what I've said!

True enough :-) Part of the problem in discussing Unicode (and other things) is that one can speak to it at a 'standards' level or an 'in practice' level at whatever level of practice the person encounters Unicode. By encoding I wasn't intending to imply that it was like dealing with a codepage equivalent, but that there are assumptions that are part of using Unicode that may not be visible to the people using it. I'm thinking that the stated intent by a programmer, say in an open source project, that the project is using unicode for the purposes of being 'politically friendly' and interoperable would have the effect of not only making the statement, but encouraging people to help guide the programmer(s) in actually achieving that goal -- those who have a deeper understanding of the issues informing those who are looking for the practical goal of interoperability. Mike

Han-Teng Liao (OII)

17 Jul 17 Jul

12:16 a.m.

New subject: Information wants to be ASCII or Unicode? Tibetan-written information cannot be ASCII anyway.

First of all, I have to reframe the question in different way. Is the problem of ASCII or the problem of Unicode we are talking about? On the one extreme we can argue that would it be nice that every domain names, hyperlinks and URL should stay in English alphabets (which enters the ICANN multilingual issue which I aim to avoid in this discussion), on the other extreme we can argue that there would be no problems if everyone is using Unicode now (which implies a coercive force to impose that without the usual technology diffusion). I cannot speak for all those open source contributors out there. I did not even try to find the regional and linguistic demographics of open source community. Though I am a big fan of "good will" in Reagle's thesis, I cannot overlook the potentials of competitions and creative conflicts among all branches of open source projects. Then the question would be, who should make this efforts? I will argue that the weight is overwhelmingly weighted on people who has to use Unicode. In practice, it easily becomes a favor to be asked from those who need Unicode, and extra work to be done by the IT support. Then Unicode the solution becomes a problem. I am not saying there is no problem in Unicode implementation. The reason why I raise the problem here in the AOIR mailing list, not in the Unicode mailing list is not to reaffirm the perception that adoption of Unicode could be difficult, but rather raise the relevant research issues around it. Imagine Wikipedia project does not manage to implement the Unicode when it is hard. Imagine Chinese Wikipedia does not manage to negotiate the simplified and traditional Chinese entry title and URL. Wikipedia will never be the same. It is not a favor that we (who need Unicode support) ask. We (internet researchers) need empirical research to see why and how the Unicode support is implemented in various projects. It is not merely a issue that we should provide better support for programmers. Again, I am not arguing that the transition from non-Unicode to Unicode is easy and could be done overnight, and hence I have no intention to imply that it is all programmers' unwillingness and laziness to finish the mundane jobs. It is the opposite. If we lay out why, how much and how Wikipedia, youtube, Google and etc. invest in Unicode deployment (exploiting the open nature of Internet), we can better understand the richer dimensions of techno-linguistic polices. It is not my intention to play blame game (the west versus east or the programmers versus users). It is the opposite. Why Baidu supports simplified Chinese versions of services, excluding Tibetans, Hong Kongese and even Taiwanese whom Beijing try to represent while Google and Youtube do much better jobs in creating a space where East Asians can fight with each other on the same page. I hope this case shows my intention to make this an interesting research issue for mutli-discinplinary research than blaming any particular groups of people. I hope we are debating on "Information wants to be ASCII or Unicode" versus "Information wants to be digital", not "Information moving from ASCII to Unicode is difficult". Then the issue would be clearer. Who decides what digital standards should be selected and deployed. What is the negotiation process. And why? Operating systems, global websites, regional websites, e-government services, citation databases etc are all the domains we should ask. Mike Stanger wrote:

...

...
Just as a bit of evidence of how difficult it can be to grok character issues: Unicode is not "an encoding" itself, but a repertoire of characters, their names, and (abstract) code points (i.e., UCS), plus a set of encodings (i.e., UTF-8, UTF-16), extra properties, and algorithms. And I'm sure a Unicode geek could pick some wholes in what I've said!

True enough :-) Part of the problem in discussing Unicode (and other things) is that one can speak to it at a 'standards' level or an 'in practice' level at whatever level of practice the person encounters Unicode. By encoding I wasn't intending to imply that it was like dealing with a codepage equivalent, but that there are assumptions that are part of using Unicode that may not be visible to the people using it.

I'm thinking that the stated intent by a programmer, say in an open source project, that the project is using unicode for the purposes of being 'politically friendly' and interoperable would have the effect of not only making the statement, but encouraging people to help guide the programmer(s) in actually achieving that goal -- those who have a deeper understanding of the issues informing those who are looking for the practical goal of interoperability.

Mike _______________________________________________ The Air-L@listserv.aoir.org mailing list is provided by the Association of Internet Researchers http://aoir.org Subscribe, change options or unsubscribe at: http://listserv.aoir.org/listinfo.cgi/air-l-aoir.org

Join the Association of Internet Researchers: http://www.aoir.org/

-- Han-Teng Liao PhD Candidate Oxford Internet Institute http://www.oii.ox.ac.uk/people/students.cfm?id=123

Joseph Reagle

12:50 p.m.

New subject: Information wants to be ASCII or Unicode? Tibetan-written information cannot be ASCII anyway.

On Thursday 16 July 2009, Han-Teng Liao (OII) wrote:

...

ask. We (internet researchers) need empirical research to see why and how the Unicode support is implemented in various projects.

I did not appreciate this point, and it is an interesting one. I haven't followed the literature that takes on standardization as a business or social science concern and so don't know if people have focused on Unicode at all. (I'm thinking of continuations of Cargill's "Open Systems Standardization" and Agre's course "Institutional Aspects of Computing" from the 90s.)

...

Imagine Wikipedia project does not manage to implement the Unicode when it is hard.

As an aside, chapter 6 of Andrew Lih's "The Wikipedia Revolution" addresses the extremely clever way that Wikimedia translates between Chinese writing systems, also: http://meta.wikimedia.org/wiki/Automatic_conversion_between_simplified_and_t...

...

I hope we are debating on "Information wants to be ASCII"

This issue was also a big one at Project Gutenberg, which I touch upon in my dissertation, also: http://osdir.com/ml/culture.literature.e-books.gutenberg.volunteers/2004-01/...

Andrew Russell

2:39 p.m.

New subject: Information wants to be ASCII or Unicode? Tibetan-written information cannot be ASCII anyway.

On Jul 17, 2009, at 8:50 AM, Joseph Reagle wrote:

...

On Thursday 16 July 2009, Han-Teng Liao (OII) wrote:

...
ask. We (internet researchers) need empirical research to see why and how the Unicode support is implemented in various projects.

I did not appreciate this point, and it is an interesting one. I haven't followed the literature that takes on standardization as a business or social science concern and so don't know if people have focused on Unicode at all. (I'm thinking of continuations of Cargill's "Open Systems Standardization" and Agre's course "Institutional Aspects of Computing" from the 90s.)

It is an interesting point, and I am more comfortable with the issues being framed this way - that is, looking at what people do rather than what information "wants". We had a session on standards at SHOT last year (Session 42 at http:// shotlisbon2008.com/program/conferenceschedule13.htm) with one paper on Korean Standard Character Code and Unicode by Dong-oh Park. A pdf of his abstract is available from the link; I don't know if he's on this list but I can find an email address for him if you want to follow up to get the full paper & references. Andy

Mike Stanger

6:46 p.m.

New subject: Information wants to be ASCII or Unicode? Tibetan-written information cannot be ASCII anyway.

I'll combine a response to multiple messages in one, hopefully I don't break the context:

...

First of all, I have to reframe the question in different way. Is the problem of ASCII or the problem of Unicode we are talking about? On the one extreme we can argue that would it be nice that every domain names, hyperlinks and URL should stay in English alphabets (which enters the ICANN multilingual issue which I aim to avoid in this discussion), on the other extreme we can argue that there would be no problems if everyone is using Unicode now (which implies a coercive force to impose that without the usual technology diffusion). [snip] Again, I am not arguing that the transition from non-Unicode to Unicode is easy and could be done overnight, and hence I have no intention to imply that it is all programmers' unwillingness and laziness to finish the mundane jobs. It is the opposite. If we lay out why, how much and how Wikipedia, youtube, Google and etc. invest in Unicode deployment (exploiting the open nature of Internet), we can better understand the richer dimensions of techno- linguistic polices. It is not my intention to play blame game (the west versus east or the programmers versus users). It is the opposite. Why Baidu supports simplified Chinese versions of services, excluding Tibetans, Hong Kongese and even Taiwanese whom Beijing try to represent while Google and Youtube do much better jobs in creating a space where East Asians can fight with each other on the same page. I hope this case shows my intention to make this an interesting research issue for mutli-discinplinary research than blaming any particular groups of people. I hope we are debating on "Information wants to be ASCII or Unicode" versus "Information wants to be digital", not "Information moving from ASCII to Unicode is difficult". Then the issue would be clearer. Who decides what digital standards should be selected and deployed. What is the negotiation process. And why? Operating systems, global websites, regional websites, e-government services, citation databases etc are all the domains we should ask.

I think with this reframing of your question I understand the issue you pose better: I was addressing commentary that I often hear in other contexts where Unicode is proposed as a 'solution' to multi- language representation in applications/sites/documentation at a trivial level. ie: if we use Unicode, we can support any character, ergo we can support any language, but that's obviously incorrect as you mention above. The case of Baidu would be a very interesting one to see what pressures may be at play given their inception largely as a media search site, and later has, apparently official 'licensing' from Beijing itself in order to add functionality. The effect of that interaction on the decision of the company (assuming an active process) to support only jianti characters would be interesting to follow.

...

I cannot speak for all those open source contributors out there. I did not even try to find the regional and linguistic demographics of open source community. Though I am a big fan of "good will" in Reagle's thesis, I cannot overlook the potentials of competitions and creative conflicts among all branches of open source projects. Then the question would be, who should make this efforts? I will argue that the weight is overwhelmingly weighted on people who has to use Unicode. In practice, it easily becomes a favor to be asked from those who need Unicode, and extra work to be done by the IT support. Then Unicode the solution becomes a problem. I am not saying there is no problem in Unicode implementation. The reason why I raise the problem here in the AOIR mailing list, not in the Unicode mailing list is not to reaffirm the perception that adoption of Unicode could be difficult, but rather raise the relevant research issues around it.

There are a number of interesting aspects to follow i) the resources required to support the use of Unicode with the intent to provide, say, the ability for a site to be read in both jiantizi and fantizi (at least for one scope, given the example of Baidu) ii) the negotiation of the process of support within the Open Source community - as you say, is the weight of the responsibility on the people who need the support? iii) the reasons that an institutional entity (say a business or university) might choose to expend the resources to provide Unicode as a piece of the base infrastructure. (eg. market share, goodwill, officially stated requirements) The variant I would expect (for what that is worth) is the most complex would be: iv) the reasons that an entity chooses to use an infrastructure that excludes the ability to support, say, jianti and fanti ... eg. Chinese university websites such as Jilin Daxue (just using that as an example because I was there for a couple of semesters in the early 90s, they're not an exception, just an example at the top of my mind) -- The school has students from Taiwan, Japan, Russia and other places around the world, including those who are only experienced with fantizi, but the pages are encoded as GB2312 . Is that because they feel their target market may be better supported with GB2312 (eg. some having computers with older versions of operating systems that will support GB2312 but not UTF-8, but systems that support UTF-8 will also support GB2312, ergo they're just addressing the lowest common denominator of their market)? Is there an official edict that universities should only use the national standard character set, regardless of who they might target (which would seem to be counter- productive from a marketing stand-point)? Or was there no active decision at all: was the website created with existing tools and support people who haven't considered the implications and haven't made an active choice? Our own University's website is almost entirely in English even though our country is official French/English bi-lingual. Supporting both French and English is a simple problem, but what are the reasons that French is not supported (being close to the problem I'd suspect that resources and target market are the primary reasons, as well as the lack of a central web content management system). We have a connection also with Zhejiang University (a joint degree program) which is seen as a key connection to the internationalization of the university: the page that describes this has one line of Chinese text in jiantizi ( http://www.cs.sfu.ca/undergrad/prospective/ddp/ ) but none in fanti, which in the history of Vancouver and environs has a much greater pool of readers given that many locals of Chinese ethnicity were schooled in Hong Kong, or other fanti using countries, and most of the Chinese schools here have taught with fantizi as well. As a result, our local media (newspapers, television, etc.) in Chinese are all in fantizi.

...

Not sure about "not using" Unicode can solve the interoperability issues. If the use of Unicode is one of the more attractive solutions that can deliver some interoperability solutions (as Google, Wikipedia, Youtube, etc. try to do, then I do not know whether the two belief is "much the same sin". [snip] Agree, using Unicode by itself cannot save the world. Still, do you mind showing me not using Unicode or other alternatives would solve the issues better? If such solution or vision does exist, why Google, Wikipedia, Microsoft, Linux, Mac, etc., adopts the Unicode? I am not citing these examples to refute your argument. I am genuinely intrigued to find out why they come to certain solution but not others (including maintaining the status quo by not deploying Unicode to some extent).

the "much the same sin" remark was in reference to using Unicode without providing additional layers that support true internationalization. Again, referring to the naïve approach that some take that using Unicode is sufficient to represent information, where the error is made in not understanding that Unicode is only a part of a set of tools that supports internationalization and/or localization. I suppose that I'm reading more into the term "solution" and "vision" than you intend.

...

I slightly disagree on the meaning of interoperability. If interoperability means a certain linguistic space can still use a non-Unicode standard, then it may create a linguistic hierarchy. For example, Chinese can use GB2312 through out in their user- generated websites, and then Tibetans and traditional Chinese characters cannot have a voice. Again imagine Youtube cannot automatically take the content contributed by Arabic or Persian users, but only some kind of "interfaces" to promise the interoperability. To me it is not about a full support of Unicode at this moment, but it is the awareness that the fact that Unicode is arguably the most open linguistic infrastructure receives little attention.

Then the sharp question will be, can Beijing, Washington, London, Tokyo deliver their government services and communicative spaces by sticking to their linguistic ghetto without using Unicode or other open linguistic architecture?

...

Agree, good will matters. Still, efforts to deliver that good will matter as well. I will exhibit some evidence in another email that inside Perl (the programming language that supports MediaWiki which makes Wikipedia possible) and the logo of Wikipedia and Chinese Wikipedia, most of the efforts are requested and done by those who need Unicode support. Then it is not only a picture of good will but some kind of push and pull.

MediaWiki/Wikipedia are written in PHP, not Perl (unless historical versions used Perl? If so, I was previously unaware of that - I've only worked with MediaWiki in PHP). The push and pull is an interesting aspect, and in the case of MediaWiki / Wikipedia, it's a good example of the needs of the community being somewhat supported by those who need the functionality. Another example, though is a slightly different variant: Facebook: a commercial entity whose localization efforts seem to be community based (eg. translations are done largely by volunteers on a request by Facebook for participants, presumably as a result of requests from users -- I've not quite figured out who was the group of users who supported the English (Pirate) translation :-) )....

...

Overall, from the above evidence, it could be argued that Wikipedia's internationalization is a clear effort to adopt the Unicode standards by mostly the Unicode-needed crowd. It is worth pointing out that around 2001 and 2002, the major operating systems such as Microsoft and Mac that most normal PC users used at that time seem to be not Unicode available yet, which makes such development in Wikipedia more interesting.

...

Again, coming back to the original question. Why Wikipedia wants to be Unicode? or....Why not Wikipedia choose other solutions to deliver interoperability?

Thinking about MediaWiki and PHP: Looking at the PHP history page ( http://ca3.php.net/manual/en/history.php.php ) and other places, I cannot determine when proper internationalization support was achieved, but do notice that a true Unicode module is still in the internal development phase ( http://www.php.net/manual/en/intro.unicode.php ) .. but I have to wonder how the Wikipedia site would have developed 'internationally' had the development environment been different. In the programming language Java, the default character encoding has always been Unicode (allow me to use the term inaccurately for convenience) as far as I am aware. But given that it was intended initially as a language to support set-top appliances that would likely be sold internationally, was that simply a 'corporate decision?' How might MediaWiki developed if written in Java initially? And to follow on to this point in another message: On Jul 17, 2009, at 7:39:30 AM PDT (CA), Andrew Russell wrote:

...

On Jul 17, 2009, at 8:50 AM, Joseph Reagle wrote:

...
On Thursday 16 July 2009, Han-Teng Liao (OII) wrote:

...
ask. We (internet researchers) need empirical research to see why and how the Unicode support is implemented in various projects.

I did not appreciate this point, and it is an interesting one. I haven't followed the literature that takes on standardization as a business or social science concern and so don't know if people have focused on Unicode at all. (I'm thinking of continuations of Cargill's "Open Systems Standardization" and Agre's course "Institutional Aspects of Computing" from the 90s.)

It is an interesting point, and I am more comfortable with the issues being framed this way - that is, looking at what people do rather than what information "wants".

True, though I think there is an interesting path that could be taken in the sense of what information 'wants' by making a small indirect reference to W.J.T. Mitchell's work (What do Pictures Want): seeing 'want' as both meaning "to lack" (as in being denied a means to participate in a particular forum such as Baidu's sites) and how information/language has no power as an agent alone without another agent to receive and process it. One could make an argument that the development of Unicode itself is the expression of the desire of information to have power and meaning across boundaries. Following that line of thought, the question could be asked: Given that information has no power without the ability to be communicated, what does an entity gain or lose by adopting a standard such as Unicode, (eg. the control of messages, the acquisition of markets, the benefit of intercultural communication for its own sake, etc.) and how does that affect power relations (etc.) Mike

han-teng.liao＠oii

31 Oct 31 Oct

11:32 a.m.

New subject: Information wants to be ASCII or Unicode? Tibetan-written information cannot be ASCII anyway.

Thank you Mike Stanger for a detailed and sincere reply. I am happy to correct some of my mistakes and insists on the issue of Unicode adoption (after the day when the ICANN has announced that non-Latin domain names has been approved, which will possibly complicate the conversation in the near future). Before we go into the nitty-gritty of the discussion, allow me to reframe the issue in terms of "default culture" and "redundancy". We should be able to agree that in terms of language, because of the historical development, Latin-based English is the default culture of the World Wide Web, which means English typing is available for almost everyone. Unicode aims to provide a utopia for everyone to display and input where all languages can be digitally processed. The reality now is somewhere in between. Extra effort seems to be necessary for specific language support. Unicode is merely an architecture for a utopia to be implemented. Unicode-ready cannot solve the language capacity problem immediately. On the other hand, without Unicode architecture, there is little choice for languages to co-exist without prior agreement made in the Unicode. Once we have a clear distinction between architecture / infrastructure (really don't have a nice metaphor for this) and actual implementation / support, then we can realize that Unicode is a architecture "solution" for multilingual support where "actual implementation" is pending. Now, if a small business owner in North America want to stay in ASCII or Latin-only environment, that is his or her own choice to make, especially if the market for non-Latin support is low for him or her. However, if big players like universities, governments, etc., tell me that they cannot support multi-lingual capacity yet for various reasons (under-investment, lack of expertise, lack of demand, etc.), I would suggest this: (1) Try everything you can to be Unicode-ready, which is not that difficult these days if we are not asking for Unicode-complete, or Unicode-to-the-core. (2) Leave room for future implementation. I support this suggestion with two arguments: (1) the extra cost from Latin-only architecture to Unicode-ready architecture is increasingly reduced to the extra *redundancy* of storage space. (2) these extra *redundant* storage space is nothing compared to the demand of multimedia materials. Metaphorically, big players should build the houses right in the first beginning. It is another issue if the new rooms in the houses (redundant room for other languages) are still empty and cannot practically accommodate other languages. Some people in the future or users out there can help and work on that. At least the space is available there. And I will later argue it is not really a redundant but a necessary gesture for an environment which values openness. Therefore, I am aware and do agree that full support to display and input every languages on every single personal computers are not necessary. Still I am of the strong opinion that any service websites made by big players should start using Unicode architecture, as most of them already has done. The Jilin Daxue example you mentioned is a perfect example to illustrate this. Chinese universities and governments do have a choice of Unicode-compatible national standard since a few years ago called GB 18030 (http://en.wikipedia.org/wiki/GB18030). It is claimed to solve the issues of simplified/orthodox Chinese characters (or jianti and fanti) and even Mongolian scripts <http://en.wikipedia.org/wiki/Mongolian_script> and Tibetan scripts <http://en.wikipedia.org/wiki/Tibetan_script>. In addition, since 2006 Beijing has mandated that every software sold in China has to support this standard. It is then very precarious that many websites and webpages in China are still GB2312 only (and thus simplified Chinese characters only) when the software they use should be, as mandated by the authorities in Beijing, GB 18030 ready and thus Unicode-ready. In GB2312 only websites, traditional Chinese characters, Mongolian scripts, Tibetan scripts are denied of "existence", except for Latin alphabets. So in the eyes of westerners, it may be okay to stay in a Latin-only environment where other languages being denied of "existence" may not be such a big issue. What about this? "Name Not on Our List? Change It, China Says" http://www.nytimes.com/2009/04/21/world/asia/21china.html (To geeky audience, the Chinese character mentioned in the new york times is supported by both GB 18030 and Unicode, so it maybe cause the character is traditional /orthodox one..... ) Therefore, could we agree on the point that the full Unicode support may depend on the demand and resource of a certain IT project, but there is no need to stick to Latin-only architecture or text when the extra redundant storage space is only a low price to pay for future extension, good gesture, and a statement to be language-neutral? I have just checked the SFU's website. I am pleasantly surprised they are already encoded in Unicode. I do not care if they only have English and Canadian content, which only reflects the cultural and political context the university is situated. However, the fact that they are using Unicode as architecture for web content proves my point. It means that if any members of University want to contribute or mix the languages of their choice, they are not automatically denied because the fundamental web content architecture does not support these languages. What would you explain why most websites of Chinese governments still sticks to GB 2312 when they literally mandate softwares have to support GB 18030 which is more inclusive? The SFU case in Canada is a nice contrast where they adopt Unicode for the website anyway even when the official languages of Canada can be easily supported by Latin-only encoding. Going back to my initial question: "Information wants to be ASCII or Unicode?" I hope I have made the point that Information should be Unicode so as to avoid the situation where some languages are denied of digital existence fundamentally in script or character encoding. I am aware that on personal computers an universal support for all languages (including typing and displaying) is up to individual choices. However, I have to insist that information, online or offline, should be Unicode-ready. It is one thing to sponsor everyone to an open party. It is another thing to invite everyone in an open party. I am insisting on the latter. Unicode is an open invitation. Following Dr. Andrew L. Russell's suggestion of re-framing the issue, it would be like this: Information wants to be Unicode because people are nice enough to invite every languages into the digital worlds. (Maybe it is not the case for some state players ......language politics) In response to the issue of Project Gutenberg, I respect people's choice between plain texts or html formats. However, it should not conflate with the choice between Unicode and ASCII. One can perfectly have a Unicode plain text file. ----- *Correction:* Indeed, as Dr. Mike Stanger has rightly corrected, "MediaWiki/Wikipedia are written in PHP, not Perl". It is my own mistake and bad memory. Orz

...

<

PS. It is interesting to point out, as part of a bigger endeavor to trace how Unicode support has been made possible by what kind of open source community members, that probably the Unicode guy inside PHP community is Andrei Zmievski. His Russian heritage and the blog entry "My name is not really Andrei" may be of interest. http://zmievski.org/2006/07/my-name-is-not-really-andrei Mike Stanger wrote:

...

I'll combine a response to multiple messages in one, hopefully I don't break the context:

[snip]

I think with this reframing of your question I understand the issue you pose better: I was addressing commentary that I often hear in other contexts where Unicode is proposed as a 'solution' to multi-language representation in applications/sites/documentation at a trivial level. ie: if we use Unicode, we can support any character, ergo we can support any language, but that's obviously incorrect as you mention above.

The case of Baidu would be a very interesting one to see what pressures may be at play given their inception largely as a media search site, and later has, apparently official 'licensing' from Beijing itself in order to add functionality. The effect of that interaction on the decision of the company (assuming an active process) to support only jianti characters would be interesting to follow.

[snip]

There are a number of interesting aspects to follow

i) the resources required to support the use of Unicode with the intent to provide, say, the ability for a site to be read in both jiantizi and fantizi (at least for one scope, given the example of Baidu)

ii) the negotiation of the process of support within the Open Source community - as you say, is the weight of the responsibility on the people who need the support?

iii) the reasons that an institutional entity (say a business or university) might choose to expend the resources to provide Unicode as a piece of the base infrastructure. (eg. market share, goodwill, officially stated requirements)

The variant I would expect (for what that is worth) is the most complex would be:

iv) the reasons that an entity chooses to use an infrastructure that excludes the ability to support, say, jianti and fanti ... eg. Chinese university websites such as Jilin Daxue (just using that as an example because I was there for a couple of semesters in the early 90s, they're not an exception, just an example at the top of my mind) -- The school has students from Taiwan, Japan, Russia and other places around the world, including those who are only experienced with fantizi, but the pages are encoded as GB2312 . Is that because they feel their target market may be better supported with GB2312 (eg. some having computers with older versions of operating systems that will support GB2312 but not UTF-8, but systems that support UTF-8 will also support GB2312, ergo they're just addressing the lowest common denominator of their market)? Is there an official edict that universities should only use the national standard character set, regardless of who they might target (which would seem to be counter-productive from a marketing stand-point)? Or was there no active decision at all: was the website created with existing tools and support people who haven't considered the implications and haven't made an active choice?

Our own University's website is almost entirely in English even though our country is official French/English bi-lingual. Supporting both French and English is a simple problem, but what are the reasons that French is not supported (being close to the problem I'd suspect that resources and target market are the primary reasons, as well as the lack of a central web content management system). We have a connection also with Zhejiang University (a joint degree program) which is seen as a key connection to the internationalization of the university: the page that describes this has one line of Chinese text in jiantizi ( http://www.cs.sfu.ca/undergrad/prospective/ddp/ ) but none in fanti, which in the history of Vancouver and environs has a much greater pool of readers given that many locals of Chinese ethnicity were schooled in Hong Kong, or other fanti using countries, and most of the Chinese schools here have taught with fantizi as well. As a result, our local media (newspapers, television, etc.) in Chinese are all in fantizi.

[snip]

the "much the same sin" remark was in reference to using Unicode without providing additional layers that support true internationalization. Again, referring to the naïve approach that some take that using Unicode is sufficient to represent information, where the error is made in not understanding that Unicode is only a part of a set of tools that supports internationalization and/or localization. I suppose that I'm reading more into the term "solution" and "vision" than you intend.

[snip]

MediaWiki/Wikipedia are written in PHP, not Perl (unless historical versions used Perl? If so, I was previously unaware of that - I've only worked with MediaWiki in PHP).

The push and pull is an interesting aspect, and in the case of MediaWiki / Wikipedia, it's a good example of the needs of the community being somewhat supported by those who need the functionality. Another example, though is a slightly different variant: Facebook: a commercial entity whose localization efforts seem to be community based (eg. translations are done largely by volunteers on a request by Facebook for participants, presumably as a result of requests from users -- I've not quite figured out who was the group of users who supported the English (Pirate) translation :-) )....

[snip]

Thinking about MediaWiki and PHP: Looking at the PHP history page ( http://ca3.php.net/manual/en/history.php.php ) and other places, I cannot determine when proper internationalization support was achieved, but do notice that a true Unicode module is still in the internal development phase ( http://www.php.net/manual/en/intro.unicode.php ) .. but I have to wonder how the Wikipedia site would have developed 'internationally' had the development environment been different. In the programming language Java, the default character encoding has always been Unicode (allow me to use the term inaccurately for convenience) as far as I am aware. But given that it was intended initially as a language to support set-top appliances that would likely be sold internationally, was that simply a 'corporate decision?' How might MediaWiki developed if written in Java initially?

And to follow on to this point in another message:

[snip]

True, though I think there is an interesting path that could be taken in the sense of what information 'wants' by making a small indirect reference to W.J.T. Mitchell's work (What do Pictures Want): seeing 'want' as both meaning "to lack" (as in being denied a means to participate in a particular forum such as Baidu's sites) and how information/language has no power as an agent alone without another agent to receive and process it. One could make an argument that the development of Unicode itself is the expression of the desire of information to have power and meaning across boundaries. Following that line of thought, the question could be asked: Given that information has no power without the ability to be communicated, what does an entity gain or lose by adopting a standard such as Unicode, (eg. the control of messages, the acquisition of markets, the benefit of intercultural communication for its own sake, etc.) and how does that affect power relations (etc.)

Mike

Han-Teng Liao (OII)

17 Jul 17 Jul

5:02 a.m.

Running the risk of taking your comments out of the context, I have listed the following responses. Mike Stanger wrote:

...

......The use of Unicode believing that it solves the interoperability issues and/or is a communication about the intent of the programmer is much the same sin, in my view. Not sure about "not using" Unicode can solve the interoperability issues. If the use of Unicode is one of the more attractive solutions that can deliver some interoperability solutions (as Google, Wikipedia, Youtube, etc. try to do, then I do not know whether the two belief is "much the same sin".

...

...... However, just using unicode isn't going to resolve all of the interoperability issues (eg. reading direction, and other unique features of the written form of a particular language, etc.). Agree, using Unicode by itself cannot save the world. Still, do you mind showing me not using Unicode or other alternatives would solve the issues better? If such solution or vision does exist, why Google, Wikipedia, Microsoft, Linux, Mac, etc., adopts the Unicode? I am not citing these examples to refute your argument. I am genuinely intrigued to find out why they come to certain solution but not others (including maintaining the status quo by not deploying Unicode to some extent).

...

Ultimately though, what data storage in Unicode does provide almost automatically is the preservation of the appropriate data (unless it gets transformed of course), and its use /could potentially/ signal the intent by the author to enable the coexistence of mixed language content as a politically friendly gesture. I would agree that character encodings could potentially send a signal about the /intent/ to be good internet citizens, or that the /intentional/ use of something other than unicode could be seen as a statement of political position (eg. mainland China's use of jianti character sets in a particular code page vs. a codepage that supported fanti). Agree, good will matters. Still, efforts to deliver that good will matter as well. I will exhibit some evidence in another email that inside Perl (the programming language that supports MediaWiki which makes Wikipedia possible) and the logo of Wikipedia and Chinese Wikipedia, most of the efforts are requested and done by those who need Unicode support. Then it is not only a picture of good will but some kind of push and pull.

...

However, I think often programmer intent is lost in the end-product. It would be encouraging to see a movement where programmers stated that their /active decision/ to use Unicode is a deliberate recognition of the multitude of languages as a 'politically friendly' gesture. Politically friendly or politically correct could be a bit patronizing. I will argue that Wikipedia benefits more from other language versions (ranking higher in search results, better webometric position, etc.).

...

I also assume that there are many coders who are using unicode, but doing so less than deliberately, perhaps even as a side-effect of the development environment that they use (eg. Java's native character/string support), /mirroring the use of ASCII in earlier environments/. These applications may well support Unicode at the character level, but because the programmer's use of Unicode is a sort of side-effect, the end product may not actually interoperate with other languages properly or completely. So while I agree that the use of Unicode is a step forward in interoperability, I'd argue that the work to be done is not so much about the use of Unicode, but the '/publicly' stated intent to be interoperable./ Unicode may be one tool that can assist in that goal if used properly, but the use of Unicode alone says little about intent.

I slightly disagree on the meaning of interoperability. If interoperability means a certain linguistic space can still use a non-Unicode standard, then it may create a linguistic hierarchy. For example, Chinese can use GB2312 through out in their user-generated websites, and then Tibetans and traditional Chinese characters cannot have a voice. Again imagine Youtube cannot automatically take the content contributed by Arabic or Persian users, but only some kind of "interfaces" to promise the interoperability. To me it is not about a full support of Unicode at this moment, but it is the awareness that the fact that Unicode is arguably the most open linguistic infrastructure receives little attention. Then the sharp question will be, can Beijing, Washington, London, Tokyo deliver their government services and communicative spaces by sticking to their linguistic ghetto without using Unicode or other open linguistic architecture? -- Han-Teng Liao PhD Candidate Oxford Internet Institute http://www.oii.ox.ac.uk/people/students.cfm?id=123

Han-Teng Liao (OII)

5:24 a.m.

Using Wikipedia as a case to further the discussion (1) The history of Wikipedia logo: From English only to International identity .....and some mistakes along the way... http://en.wikipedia.org/wiki/Wikipedia:Wikipedia_logos http://meta.wikimedia.org/wiki/Wikipedia/Logo (2) Unsung hero (in my personal view, open to debate) Autrijus Tang's effort in Perl Internationalization http://www.perl.com/pub/a/2005/09/08/autrijus-tang.html Tang is a Taiwanese hacker. (3) Unicode's support in Wikipedia I have problem to locate the version control file to see when Unicode began to be supported and fully supported. http://meta.wikimedia.org/wiki/Wikipedia_timeline (not mentioning Unicode here) However, according to the entry of "Chinese Wikipedia" in English Wikipedia, we have the following paragraphs: ========================== The Chinese Wikipedia was established along with 12 other Wikipedias in May 2001. At the beginning, however, the Chinese Wikipedia did not support Chinese characters <http://en.wikipedia.org/wiki/Chinese_character>, and had no encyclopedic content. It was in October 2002 that the first Chinese-language page was written, the Main Page <http://zh.wikipedia.org/wiki/>. The first registered user of the Chinese Wikipedia was Mountain. A software update <http://en.wikipedia.org/wiki/Software_update> on October 27 <http://en.wikipedia.org/wiki/October_27>, 2002 <http://en.wikipedia.org/wiki/2002> allowed Chinese language input. ..... In order to accommodate the orthographic differences between simplified Chinese <http://en.wikipedia.org/wiki/Simplified_Chinese> and traditional Chinese <http://en.wikipedia.org/wiki/Traditional_Chinese> (or Orthodox Chinese), from 2002 to 2003, Chinese Wikipedia community gradually decided to combine the two originally separate versions of Chinese Wikipedia. The first running automatic conversion between the two orthographic representation starts from December 23, 2004, with MediaWiki 1.4 release. The needs from Hong Kong and Singapore were taken into accounts in MediaWiki 1.4.2 release, which made conversion table for zh-sg default to zh-cn, and zh-hk default to zh-tw.^[2] <http://en.wikipedia.org/wiki/Chinese_Wikipedia#cite_note-1> ^========================== Overall, from the above evidence, it could be argued that Wikipedia's internationalization is a clear effort to adopt the Unicode standards by mostly the Unicode-needed crowd. It is worth pointing out that around 2001 and 2002, the major operating systems such as Microsoft and Mac that most normal PC users used at that time seem to be not Unicode available yet, which makes such development in Wikipedia more interesting. Again, coming back to the original question. Why Wikipedia wants to be Unicode? or....Why not Wikipedia choose other solutions to deliver interoperability? -- Han-Teng Liao PhD Candidate Oxford Internet Institute http://www.oii.ox.ac.uk/people/students.cfm?id=123 Han-Teng Liao (OII) wrote:

...

Running the risk of taking your comments out of the context, I have listed the following responses. Mike Stanger wrote:

...
......The use of Unicode believing that it solves the interoperability issues and/or is a communication about the intent of the programmer is much the same sin, in my view. Not sure about "not using" Unicode can solve the interoperability issues. If the use of Unicode is one of the more attractive solutions that can deliver some interoperability solutions (as Google, Wikipedia, Youtube, etc. try to do, then I do not know whether the two belief is "much the same sin". ...... However, just using unicode isn't going to resolve all of the interoperability issues (eg. reading direction, and other unique features of the written form of a particular language, etc.). Agree, using Unicode by itself cannot save the world. Still, do you mind showing me not using Unicode or other alternatives would solve the issues better? If such solution or vision does exist, why Google, Wikipedia, Microsoft, Linux, Mac, etc., adopts the Unicode? I am not citing these examples to refute your argument. I am genuinely intrigued to find out why they come to certain solution but not others (including maintaining the status quo by not deploying Unicode to some extent). Ultimately though, what data storage in Unicode does provide almost automatically is the preservation of the appropriate data (unless it gets transformed of course), and its use /could potentially/ signal the intent by the author to enable the coexistence of mixed language content as a politically friendly gesture. I would agree that character encodings could potentially send a signal about the /intent/ to be good internet citizens, or that the /intentional/ use of something other than unicode could be seen as a statement of political position (eg. mainland China's use of jianti character sets in a particular code page vs. a codepage that supported fanti). Agree, good will matters. Still, efforts to deliver that good will matter as well. I will exhibit some evidence in another email that inside Perl (the programming language that supports MediaWiki which makes Wikipedia possible) and the logo of Wikipedia and Chinese Wikipedia, most of the efforts are requested and done by those who need Unicode support. Then it is not only a picture of good will but some kind of push and pull.

...
However, I think often programmer intent is lost in the end-product. It would be encouraging to see a movement where programmers stated that their /active decision/ to use Unicode is a deliberate recognition of the multitude of languages as a 'politically friendly' gesture. Politically friendly or politically correct could be a bit patronizing. I will argue that Wikipedia benefits more from other language versions (ranking higher in search results, better webometric position, etc.).

I also assume that there are many coders who are using unicode, but doing so less than deliberately, perhaps even as a side-effect of the development environment that they use (eg. Java's native character/string support), /mirroring the use of ASCII in earlier environments/. These applications may well support Unicode at the character level, but because the programmer's use of Unicode is a sort of side-effect, the end product may not actually interoperate with other languages properly or completely. So while I agree that the use of Unicode is a step forward in interoperability, I'd argue that the work to be done is not so much about the use of Unicode, but the '/publicly' stated intent to be interoperable./ Unicode may be one tool that can assist in that goal if used properly, but the use of Unicode alone says little about intent. I slightly disagree on the meaning of interoperability. If interoperability means a certain linguistic space can still use a non-Unicode standard, then it may create a linguistic hierarchy. For example, Chinese can use GB2312 through out in their user-generated websites, and then Tibetans and traditional Chinese characters cannot have a voice. Again imagine Youtube cannot automatically take the content contributed by Arabic or Persian users, but only some kind of "interfaces" to promise the interoperability. To me it is not about a full support of Unicode at this moment, but it is the awareness that the fact that Unicode is arguably the most open linguistic infrastructure receives little attention.

Then the sharp question will be, can Beijing, Washington, London, Tokyo deliver their government services and communicative spaces by sticking to their linguistic ghetto without using Unicode or other open linguistic architecture?

6088

Age (days ago)

6197

Last active (days ago)

List overview

Download

12 comments

5 participants

participants (5)

Andrew Russell
Han-Teng Liao (OII)
han-teng.liao＠oii
Joseph Reagle
Mike Stanger