
|
|
|
1 Aug 2010
|
Big5 Messy Code / Unreadable Text Mapping to UTF8
In a recent project, I encounter a Big5 Messy Code (亂碼) issue. Usually one
might see unreadable text or Messy Code when using wrong encoding to read text.
For example, if you use Western European (Windows) encoding to read Chinese Big
5 encoding, you will see "¤j¤-½X¶Ã½X" instead of "大五碼亂碼". If it is on a web
page, you can easily change the encoding to the correct one (in IE menu, just go
to View -> Encoding).For non web-based application, one must need to write a
customed conversion program.
The invention of Unicode has solved most of the messy code problem but there are
still many legacy system using the old way to represent Chinese/Japanese/Korean.
Here is the scenario that I encountered. My client (actually not my
client at the very beginning) had an
application storing Chinese characters as varchar (using Windows - 1252
collation) in a relatively old database (MS SQL 2000). They read Chinese
correctly from their existing application (programmed in PowerBuilder 6).
Unfortunately, this application can only run correctly on Windows 98 which must
be phased out because of hardware replacement issue. The original programmer told my client that
it was very difficult to
upgrade the application and that he did not want to do the upgrade for them.
There was a pressing need to find a replacement solution. My client sought me (also other
software houses, I think) for advice and help. I am not competent in PowerBuilder but I told them I
can easily use Microsoft .Net to do the conversion for them. They
commissioned me to solve the problem for them. At the beginning of the
project, I
tried to use various .Net built-in encoding methods to do the messy codes
conversion but in vain. All built-in .Net encoding methods
are proved unhelpful in this project. Finally, I need to wrote my own
conversion program to solve this problem for them! The idea is to map unreadable
text into readable unicode, e.g. map "¤j" to "大", "¤-" to "五" and so
on. A table of about 60,000 characters is sufficient for most of the
commonly used Chinese character.
Although I spent extra time in completing this project, it is an excellent
experience for my future projects.
more topics...
|
|
|