Multi Language Sort

I'm facing a specific sort problem.
There are several thousands of articles sold all over. 
Users expect to get a description in local language sorted by their specific collation.

I have simplified the original class to the core fields for this case.

Class DC.Article Extends %Persistent [ Final ]
{
Property ArtNr As %Integer(MINVAL = 0) [ Required ];
/// default Description
Property 
Description As %String [ Required ];
/// German

Property DescDE As %String;
/// Italian
Property DescIT As %String;
/// English

Property DescEN As %String;
/// French
Property DescFR As %String;
/// Czech
Property DescCZ As %String;
/// Slovakian
Property DescSL As %String;
/// Hungarian
Property DescHU As %String;
/// Bulgarian
Property DescBG As %String;
/// Romanian
Property DescRO As %String;
/// Slovenian
Property DescSI As %String;

Index Idx On ArtNr [ IdKey ];
Index IdxDE On DescDE;
Index IdxHU On DescHU;
Index IdxDE On DescEN;
. . .
}

The whole construct serves as multi language lookup table.
It works based on the defined Indices [here only 3 for this example]

The problem to be solved is to have the correct language-specific collation
Some languages overlap nicely (eg. EN + DE) but others (esp. HU)  don't match
or are totally separated (e.g. BG in Cyrillic notation)
just see examplehttps://en.wikipedia.org/wiki/Hungarian_alphabet

There is actually some hand-coded workaround in place to enforce collation but performance is not thrilling.

I hope you have some brilliant suggestions to get better solutions.

  • + 5
  • 91
  • 2
  • 2

Answers

I will publish the actual workaround here later when there are other solutions or proposals
in order not to influence your creativity.
########################################################################

OK, There wasn't much echo so far.

The workaround was to encode the Description (and limit it for indexing) according to the language used.
The encoding is basically the position in the list of allowed/used characters or character groups.
If we use also numbers and interpunctuation it has to go into our list.

Instead of %String the descriptions are stored in a serial class:

Class DC.ArticleSort Extends (%SerialObject)
{
Property Language As %String [ InitialExpression = ];
Property Description As %String(MAXLEN = "");
Property Sort As %String(COLLATION = "EXACT", MAXLEN = 300) [  SqlComputed,
       ,
SqlComputeCode = { Set {*} = ##class(DC.ArticleSort).Encode({Language},{Description}) }
      ,
SqlComputeOnChange = (Description, Language) ];

ClassMethod Encode( Lang As %String, Desc As %String(MAXLEN="")) As %String(MAXLEN=300)
{ ...replace valid characters by binary postion value ..  }
}

It's obvious that for English and your installed NLS no encoding is required.
For triplets or doublets like in Hungarian, you need several runs  to encode

The Query itself has a minor change

SELECT TOP 10 ArtNr,DescHU_Description FROM DC.Article
      WHERE DescHU_Description %STARTSWITH 'BE'
      ORDER BY DescHU_Sort

The encoding is definitely not fast but change frequency is not an issue here and change happens
by typing in descriptions on the keyboard.

Query performance is supported by this index construct:

     Index IdxHU On DescHU.Sort [ Data = (ArtNr, DescHU.Description) ]; 

So the query plan just iterates over this index.
 

What's the purpose of a normal index over a text field? I assume user searches by a substring. There are two and a half ways to perform a fulltext search:

  1. Define a property of %Text type and use language class. New language classes an be added fairly easily. Documentation. Removed in InterSystems IRIS and therefore not recommended.
  2. Use iFind indices. They  are highly customisable.
  3. Use iFind indices creatively. You can build your own indices based on iFind indices and you can write your own tokenizers, etc. Also it's possible to define your own custom index. Community article on that.

#1) thanks for the Reminder on  IRIS

#1) + #2) concentrate on finding. but that's not the key issues.

As an example: The problem becomes visible when you run   

SELECT TOP 10 ArtNr,DescDE FROM Dc.Article WHERE DescDE %Startswith 'BE' ORDER by DescDE 


The problem is to get control of ORDER which depends on global collation.

Example for German:
- With Standard collation, you sort A,B,C..,O,...,U,...Z,Ä,Ö,Ü...   (classic ANSI sort)
- With collation German3, you sort A,Ä,B,C,....O,Ö,....U,Ü,....Z 

It is mostly the handling of characters with diacritical signs.

Hungarian is my worst case with much more diacriticals AND groups of characters in total  44 "character" tokens

A Á B C Cs D Dz Dzs E É F G Gy H I Í J K L Ly M N Ny O Ó Ö Ő P Q R S Sz T Ty U Ú Ü Ű V W X Y Z Zs

If you do it correctly then Cx sorts before Cs and Gz should sort before Gy and so on.

Now for ORDER BY you typically get your default collation derived from NLS settings.

#3) custom index seems to offer the most promising features.