Since 2004, the Tavultesoft website has had a reasonable degree of Unicode support.  It’s been a bit of a mixed bag, with limitations in technology choices at the time making it hard for us to achieve the level of Unicode support we wanted.  In particular, our database backend (Microsoft SQL Server) was not using Unicode columns, because our front end (PHP) did not have any stable drivers available for SQL Server that supported Unicode.  So we compromised, and stored our data in UTF-8 on what were supposed to be Windows-1252 “ANSI” columns.  This is not ideal.  But it got us over the hurdles and provided a level of Unicode support at the time.

The big risk with storing UTF-8 in a column intended for Windows-1252 codepage is that there are 5 codepoints unassigned in Windows-1252: 0x81, 0x8D, 0x8F, 0x90 and 0x9D – use of these codepoints could result in undefined behaviour.  But our tests showed that SQL Server didn’t care and just stored those bytes anyway.  Good enough for beggars.

Roll forward to today.  Microsoft now support a PHP SQL Server driver which, since version 1.1, has Unicode support via UTF-8 translation at the driver level.  Fantastic!  So how could we take advantage of this?  It turned out to be a lot less painful than we expected!

Step 1. Locate all the non-Unicode text columns

On our test server, we took a recent backup of the live database and started with that.  We analysed our database tables to find all VARCHAR, CHAR and TEXT columns.  Once we found these we deleted the indexes on these columns, including full text indexes.  There are many ways to find all the text columns in your database; probably the easiest is to query sys.columns for types 35, 167 and 175 (TEXT, VARCHAR and CHAR respectively).

Step 2. Convert the columns to Unicode equivalents

At this point we simply altered each column to its UTF-16 equivalent NVARCHAR, NCHAR or NVARCHAR(MAX).  Where a VARCHAR column had a size greater than 4000 characters, we opted to convert these into NVARCHAR(MAX) as 4000 characters is the maximum fixed column size for Unicode columns, as opposed to 8000 characters for non-Unicode columns.  For example, given a column customer.given_name VARCHAR(128) NOT NULL, we issued the following statement:

ALTER TABLE customer ALTER COLUMN given_name NVARCHAR(128) NOT NULL

This took the Windows-1252 data in the column and converted it to Unicode.  SQL Server of course did the job properly, for example, converting Windows-1252 0x80 (€ Euro sign) to the Unicode equivalent U+20AC.  Importantly, for us, with those 5 invalid bytes, SQL Server just zero-extended them and left them there.  These are considered control codes in Unicode, but we didn’t care, because we were about to fix them for good.

We did identify a number of columns that did not need to be widened, such as currency or country identifiers.  These we left as VARCHAR or CHAR columns accordingly.  The conversion process ran for 2 hours on the test server, but ran smoothly.

Step 3. Convert UTF-8 content into UTF-16

Now our data was in NVARCHAR columns, but it wasn’t right yet — it was stored as UTF-8 via Windows-1252 in the UTF-16 native type (do you follow?  I'm not sure I do…).  Each UTF-16 string needed to be converted back to Windows-1252, and from there, read as UTF-8 (without conversion!), and then converting that UTF-8 back into UTF-16.  This was theoretically possible within SQL Server with some clever T-SQL code, but it turned out to be faster and easier to write a small app to do this transformation for each column.  We used a messy LIKE clause (imagine all the characters between 128 and 255 in a row) to reduce the number of records that needed editing. This was not perfect as LIKE does not do binary matches (and hence matched more rows than it needed to) but cut the conversion time dramatically — many records could be excluded because they only contained ASCII characters.

At this point, we recreated the indexes.  Again, all this went smoothly on our test server.  Total time: 2 hours.

Step 4. Update T-SQL

We now needed to convert all our T-SQL data references into Unicode.  We dumped the full set of stored procedures, triggers, functions and more into a single massive text file, and used search and replace to find all instances of VARCHAR, TEXT, CHAR and also CHAR() and ASCII() function calls.  This covered all the cases that we used.  Despite seeming a huge task when we started, in the end it only took a couple of hours of careful replacing and checking to produce a monster T-SQL script.

When we ran this script on the test server, it all worked wonderfully and surprisingly quickly.  Total time: 2 hours of legwork, and 30 seconds of execution.

Step 5. Update PHP client code

OK, so our database was now all in Unicode, but PHP was still reading the data out and assuming it was UTF-8, which it no longer was.  So a change was needed to the sqlsrv_connect call:  Add the “CharacterSet” => “UTF-8” parameter.  That’s all — everything else was already assuming UTF-8.  Total time: negligible.

With that we were done.  Running various tests on the test server showed everything working great!

Step 6. Dry run on live infrastructure

Time to do a dry run on the live server.  Unfortunately this didn’t go so well.  We ran the process on a copy of the live database, on the live infrastructure, as moving the database onto alternative infrastructure was going to mean a lot of downtime.  First problem: we ran out of memory.  Our live server actually has less memory than our test server!  Nothing crashed, but constant swapping meant that the conversion was going to take forever.  Then we started to run out of disk space.  Our 5GB test database blew out to 25GB during the conversion!

Of course, this is why we did a dry run.  During this process, the Tavultesoft website ran slightly slower than normal, but not so much that it was be a problem.  Our database server is normally quite lightly loaded…

It turned out that one particular log table was huge and also did not really need urgent conversion to Unicode.  So we opted to skip that table and ran the dry run again.  This time, things went a bit better and the whole process (steps 2, 3 and 4 really) completed in under 90 minutes.

Step 6. The real thing

Time to run live.  We took a backup of the database, and then switched off the front end of the site with a friendly message telling our users we’d be back soon, and started the process.  80 minutes later, it was all done.  We had to rebuild the full text indexes, and the upgrade was done!

Categories: Web/Tech

0 thoughts on “Upgrading the Tavultesoft website infrastructure to improve its Unicode support”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Posts

Using Keyman Desktop

A simple tool to create key images

This is just a short blog post to describe our in-house tool for creating key images, such as the ones shown below. How to use the key image generator Visit http://keyman.com/support/keytemplate.php Type the key sequence Read more…

Developing Keyboards

Announcing the Launch of KeymanWeb 2.0 – Open Source

We are pleased to announce the launch of KeymanWeb 2.0 as an open source software product, under the Apache License, Version 2.0. What is KeymanWeb? What is KeymanWeb? It’s an open source input method framework Read more…

Developing Keyboards

Creating a Touch Keyboard Layout for Amharic with Keyman Developer 9

A widely used Keyman Desktop phonetic keyboard for Amharic (developed by Geez Frontier Foundation) was also compiled and made available for use with KeymanWeb.  When used with web browsers on desktop computers, the behaviour of Read more…