How do You Convert 8-bit Database to Unicode?

Hi Community!

Suppose you have a working system on 8-bit Caché database and need to transfer the solution to Unicode database.

How do you manage that?

There is the community solution UConv but it doesn't cover all the cases.

For some cases, XML export of all globals on 8bit

d $system.OBJ.Export("x.gbl","c:\Temp\x.xml")

and import on Unicode can help, which doesn't solve $LB cases though.

What works for you?

  • + 2
  • 0
  • 211
  • 13
  • 4

Answers

I'm sure it is quite difficult to write such a universal solution, every time it depends on your code. Years ago I wrote such a tool for an application which textual interface, and I had to convert parse code for pseudographics there and convert it too. So, I would say you need such specific too, just for your case.

Dmitry is absolutely right.  I had to deal with this task quite a long time ago. I had a presentation about this topic on our M&Caché technology group Czech&Slovak meeting back in 2004.

Just a few points from my presentation
- we had a 8bit database with data in Latin2
- we had minimum of $listbuild data these days
- if it is possible due to small amount of data -> use simple export/import  
  
  1) export data to external file on 8bit instance
  2) import data from external file on Unicode instance 
  
-  and let NLS settings do the work (everything will be done by appropriate NLS settings for files )

We had databases with a plenty of data
- our aproach was to do as much as possible "in-place" conversions
- one can unmount CACHE.DAT on original 8bit instance a mount the same CACHE.DAT on Unicode instance
- we categorized globals into 3 categories

  0 - completely without Czech characters
  1 - Czech characters only in data
  2 - Czech characters in data and subscripts
  
- group 0 : no action needed
- group 1 : wrote $QUERY loop and data conversion in place
- group 2 : global export/import

- group 0 was pretty big : fine, no action = super extremely fast :-) 
- group 2 was pretty small : fine
- group 1 was the crucial part
  * we used multiple jobs (carefully coordinated)
  * one has to skip subscripts data with BLOBs for example
  * one has to be carefull with repeatability in case of "crash"
    - multiple $ZCVT may lose some characters <->  $ZCVT($ZCVT(input,"I","Latin2"),"I","Latin2") is wrong (or at least in 2004 was wrong)

Hi everybody,

I worked on similar task for Cyrillic (CP-1251 encoded) database last fall. My approach was very close to yours, while I did not see any reason to separate Group 1 from Group 2, so all data was converted on fly, without global export/import. 

Thanks Alexey. Sure. It depens, whether you have Latin1/Latin2 characters in subscipts or not. If not, then everything is much easier and can be done by simple $query loop.

In case you have

^GLOBAL("Latin1/Latin2 subscript 1", large subnodes) 

^GLOBAL("Latin1/Latin2 subscript 2", large subnodes) 

then it is not trivial to "rename" all ^GLOBAL("Latin1/Latin2 subscript 1") node under ^GLOBAL("UTF subscript 1").

1) there is no easy/cheap way to "rename" node (kill + merge + kill is not the solution)

2) even if some "rename" solution will be implented, the the $QUERY loop will be confused

- new ^GLOBAL("UTF subscript 1") may collate arbitrary  wherever

I do rememeber my discussion with Andreas about that. And as Andreas claimed to me, that "RENAME" command is part of Caché product :-) and found later, that he was confused by some internal ISC discussions about this topic.

Thanks, Pavel. I didn't face the "renaming problem" as I was recoding and transfering global nodes to new (empty, freshly created) database.

Currently the data in subscripts does not need any translation. Once you enable your database to UNICODE, the data in subscripts pop up as UNICODE data.

Good day. This fall I had to solve this problem. For the conversion, I wrote the ZDev tool. (https://github.com/MyasnikovIA/ZDev). Maybe someone will need. The whole project is in %ZDev.xml (in the project there is an example of copying the global %ZDev.Demo.demo5CopyGlobal). Briefly about the principle of operation: A socket server is started on the remote database, the target client connects and executes the ZW ^MyGlobal.tab command, and then reads the result and restores locally, with a change in the encoding $ZCVT(...). The same action can be done through uploading to a file, but for me this mechanism was unacceptable, since the base for which this tool was written exceeds two terabytes. 

Ivan, 

I'm just curious: how long it took to transfer 2TB?

IMHO, the main problem in such endeavors is not how to recode the database as it's always possible somehow, but how to avoid long interruption of 24x7 workflow. In my case I'm developing a tool which scans journal files and remotely applies the changes to those globals that are already was (or being) transferred to Unicode database. Async Mirroring with dejournal filtration could be used instead, although I don't want to use it for several reasons. 

For me, this is also a problem. This problem has remained. Within 14 hours in the framework of one process about 80 gigabytes were downloaded. To smooth the database conversion process, it was divided into important tables and no. After that, all the small tables were overloaded in 20 threads, and the large tables were overloaded through the bust of $Order(^XXX.tab (ind), -1). Again, this is a purely individual decision. And it does not solve the problem of continuous work. If there are ready-made solutions, I will be very happy.

For all non-Latin1 sites, this is highly individual and if performed on a long running systems that can not afford (long) interruption, a true challenge. One of our customers (hospital) has done this conversion earlier this year, it took several months to prepare everything, fine tune data migration (they used shadow mechanism to offload data to a Unicode server after initial conversion whilst running the two instances in parallel for testing and validation purposes). Once everything looked ok, and shadow server was in sync with data server, they switched to shadow (Unicode) machine.

I transfer large globals in several concurrent jobs; avg speed is ~ 8MB/s. As to journals, their transfer is times slower, while I try to compensate it using "economical" approach (with filtration) mentioned above.

My toolkit is under construction yet. At the moment it's able to recode locally or remote mounted 8-bit database to Unicode one. All activity occurs at the Unicode system.

At the moment it's able to recode locally or remote mounted 8-bit database to Unicode one. All activity occurs at the Unicode system.

This approach was feasible because the only global collation we used in 8-bit database was Cache Standard. Use we something else (e.g. Cyrillic2), more complex solution would be needed, something like Pavel and Ivan briefly described.

Yes, i would like to list it on Open Exchange. What must I do for it? 

IMPORTANT NOTE:

All this discussion is only for upgrades/conversions of 8-bit instances, which use NLS locale based on different charset than Latin1 (ISO 8859-1).

Upgrade from 8-bit Caché to Unicode is trivial if 8-bit instance is using NLS locale based on Latin1 charset. In such case the upgrade is only installing the Unicode version over the 8-bit instance. This is because first 0-255 characters of Unicode are same as Latin1 charset, therefore no conversion is needed.

Of course, customer should check if all his interfaces support Unicode, eventually configure it properly (e.g. use correct driver for unix ODBC client).

There is a few questions to answer first:

- what is the 8-bit language encoding you use (does your 8-bit encoding of the "important" characters differer form the first 256 positions of the UNICODE?)

- are the affected characters used only in the data, or are they also hardcoded within routines/methods/class definitions/...?

- are there $listbuild structures used (if you use persistent classes then the answer is "yes")? 

- is it sure that the $lb structures were not used in subscripts (oh, it is silly idea isn't  it :-)?

- are binary data strictly separated form the textual ones (I mean not only pictures etc. but  various hashes, signatures etc.)?

- is it a production system?

- how big is the database, what is the time window to provide the conversion, how fast is your hardware?

- + bunch of minor or subsequent questions...

The answers and their combination affect heavily the conversion procedure. For example, if your answer to the first question is "no problem, the character codes do not differ", then no conversion is needed at all. 

If, on the other side, you face a transfer of xTB of data on a huge production system with hundreds of concurrent users with SLA > 4nines, consider some budget and hire a Maatrix expert (me).