Multi Language Sort
I'm facing a specific sort problem.
There are several thousands of articles sold all over.
Users expect to get a description in local language sorted by their specific collation.
I have simplified the original class to the core fields for this case.
{
Property ArtNr As %Integer(MINVAL = 0) [ Required ];
/// default Description
Property Description As %String [ Required ];
/// German
Property DescDE As %String;
/// Italian
Property DescIT As %String;
/// English
Property DescEN As %String;
/// French
Property DescFR As %String;
/// Czech
Property DescCZ As %String;
/// Slovakian
Property DescSL As %String;
/// Hungarian
Property DescHU As %String;
/// Bulgarian
Property DescBG As %String;
/// Romanian
Property DescRO As %String;
/// Slovenian
Property DescSI As %String;
Index Idx On ArtNr [ IdKey ];
Index IdxDE On DescDE;
Index IdxHU On DescHU;
Index IdxDE On DescEN;
. . .
}
The whole construct serves as multi language lookup table.
It works based on the defined Indices [here only 3 for this example]
The problem to be solved is to have the correct language-specific collation
Some languages overlap nicely (eg. EN + DE) but others (esp. HU) don't match
or are totally separated (e.g. BG in Cyrillic notation)
just see example: https://en.wikipedia.org/wiki/Hungarian_alphabet
There is actually some hand-coded workaround in place to enforce collation but performance is not thrilling.
I hope you have some brilliant suggestions to get better solutions.
I will publish the actual workaround here later when there are other solutions or proposals
in order not to influence your creativity.
########################################################################
OK, There wasn't much echo so far.
The workaround was to encode the Description (and limit it for indexing) according to the language used.
The encoding is basically the position in the list of allowed/used characters or character groups.
If we use also numbers and interpunctuation it has to go into our list.
Instead of %String the descriptions are stored in a serial class:
{
Property Language As %String [ InitialExpression = 0 ];
Property Description As %String(MAXLEN = "");
Property Sort As %String(COLLATION = "EXACT", MAXLEN = 300) [ SqlComputed,
,SqlComputeCode = { Set {*} = ##class(DC.ArticleSort).Encode({Language},{Description}) }
, SqlComputeOnChange = (Description, Language) ];
ClassMethod Encode( Lang As %String, Desc As %String(MAXLEN="")) As %String(MAXLEN=300)
{ ...replace valid characters by binary postion value .. }
}
It's obvious that for English and your installed NLS no encoding is required.
For triplets or doublets like in Hungarian, you need several runs to encode
The Query itself has a minor change
WHERE DescHU_Description %STARTSWITH 'BE'
ORDER BY DescHU_Sort
The encoding is definitely not fast but change frequency is not an issue here and change happens
by typing in descriptions on the keyboard.
Query performance is supported by this index construct:
Index IdxHU On DescHU.Sort [ Data = (ArtNr, DescHU.Description) ];
So the query plan just iterates over this index.
#1) thanks for the Reminder on IRIS
#1) + #2) concentrate on finding. but that's not the key issues.
As an example: The problem becomes visible when you run
The problem is to get control of ORDER which depends on global collation.
Example for German:
- With Standard collation, you sort A,B,C..,O,...,U,...Z,Ä,Ö,Ü... (classic ANSI sort)
- With collation German3, you sort A,Ä,B,C,....O,Ö,....U,Ü,....Z
It is mostly the handling of characters with diacritical signs.
Hungarian is my worst case with much more diacriticals AND groups of characters in total 44 "character" tokens
A Á B C Cs D Dz Dzs E É F G Gy H I Í J K L Ly M N Ny O Ó Ö Ő P Q R S Sz T Ty U Ú Ü Ű V W X Y Z Zs
If you do it correctly then Cx sorts before Cs and Gz should sort before Gy and so on.
Now for ORDER BY you typically get your default collation derived from NLS settings.
#3) custom index seems to offer the most promising features.
Vitaliy,
That's excellent. I had the vague feeling that it should be somewhere, somehow, ... ass these are all supported collations.
Now I have the correct solution!
Many thanks !
What's the purpose of a normal index over a text field? I assume user searches by a substring. There are two and a half ways to perform a fulltext search:
Useful links:
This issue dares relatively simply:
Check their list and status:
Here are two small examples:
Result:
Result:
Important: if you remove "As EXACT" in the index, the examples will not work correctly for the German language, because: Capital ẞ