Multi Language Sort

I'm facing a specific sort problem.
There are several thousands of articles sold all over. 
Users expect to get a description in local language sorted by their specific collation.

I have simplified the original class to the core fields for this case.

Class DC.Article Extends %Persistent [ Final ]
{
Property ArtNr As %Integer(MINVAL = 0) [ Required ];
/// default Description
Property 
Description As %String [ Required ];
/// German

Property DescDE As %String;
/// Italian
Property DescIT As %String;
/// English

Property DescEN As %String;
/// French
Property DescFR As %String;
/// Czech
Property DescCZ As %String;
/// Slovakian
Property DescSL As %String;
/// Hungarian
Property DescHU As %String;
/// Bulgarian
Property DescBG As %String;
/// Romanian
Property DescRO As %String;
/// Slovenian
Property DescSI As %String;

Index Idx On ArtNr [ IdKey ];
Index IdxDE On DescDE;
Index IdxHU On DescHU;
Index IdxDE On DescEN;
. . .
}

The whole construct serves as multi language lookup table.
It works based on the defined Indices [here only 3 for this example]

The problem to be solved is to have the correct language-specific collation
Some languages overlap nicely (eg. EN + DE) but others (esp. HU)  don't match
or are totally separated (e.g. BG in Cyrillic notation)
just see examplehttps://en.wikipedia.org/wiki/Hungarian_alphabet

There is actually some hand-coded workaround in place to enforce collation but performance is not thrilling.

I hope you have some brilliant suggestions to get better solutions.

  • + 5
  • 0
  • 209
  • 3
  • 3

Answers

Useful links:

  1. SQL and NLS Collations
  2. Collations in Caché

This issue dares relatively simply:

  • Create the required collation tables. Fortunately, for some languages, they are already included and can be taken as a basis. In complex cases, you can contact to WRC.
    Check their list and status:

    USER>zn "%SYS"
     
    %SYS>^COLLATE
     
    Status       Number   Abbrev   Name
    ----------   ------   ------   ----------------------
    Built-in        0     OANS     ISM Pre-6.2
    Built-in        1     ANSI     ISM 6.2->6.4
    Built-in        2     COBR     Ipsum/Cobra
    Built-in        3     DTMC     DTM-compatible
    Built-in        4     CBR2     Ipsum/Cobra-2
    Built-in        5     UNIC     Cache standard
    Not loaded     10     GER1     German1
    Not loaded     11     POR1     Portuguese1
    Not loaded     12     POL1     Polish1
    Not loaded     13     GER2     German2
    Not loaded     14     SPA1     Spanish1
    Not loaded     15     DAN1     Danish1
    Available      16     CYR1     Cyrillic1
    Not loaded     17     GRE1     Greek1
    Not loaded     18     CZE1     Czech1
    Not loaded     19     CZE2     Czech2
    Not loaded     20     POR2     Portuguese2
    Not loaded     21     FIN1     Finnish1
    Not loaded     22     JAP1     Japanese1
    Not loaded     24     POL2     Polish2
    Not loaded     27     FRE1     French1
    Not loaded     28     FIN2     Finnish2
    Available      29     HUN1     Hungarian1
    Available      30     GER3     German3
    Not loaded     31     POL3     Polish3
    Not loaded     32     SPA2     Spanish2
    Not loaded     33     DAN2     Danish2
    Not loaded     34     GRE2     Greek2
    Not loaded     35     FIN3     Finnish3
    Not loaded     36     LIT1     Lithuanian1
    Available      37     CYR3     Cyrillic3
    Not loaded     38     SLO1     Slovenian1
    Not loaded     39     SLO2     Slovenian2
    Not loaded     40     TUR1     Turkish1
    Not loaded     41     DAN3     Danish3
    Available      42     UKR1     Ukrainian1
    Available      43     CYR4     Cyrillic4
    Not loaded     44     CZE3     Czech3
    Not loaded     46     MAL1     Maltese1
    Not loaded     48     MAL2     Maltese2
    Not loaded     49     SPA4     Spanish4
    Not loaded     50     SLO1     Slovak1
    Not loaded     51     SPA5     Spanish5
    Not loaded     52     FIN4     Finnish4
    Not loaded     53     CZE4     Czech4
    Built-in      128     OSTR     ISM Pre-6.2 string
    Built-in      129     NSTR     ISM 6.2->6.4 string
    Built-in      133     USTR     Cache standard string
    Available     144     CYR1S    Cyrillic1 string
    Available     157     HUN1S    Hungarian1 string
    Available     158     GER3S    German3 string
    Available     165     CYR3S    Cyrillic3 string
    Available     170     UKR1S    Ukrainian1 string
    Available     171     CYR4S    Cyrillic4 string
    
  • load the new collation tables to your current locale
  • for Storage make mapping your indexes in separate globals with own collation
  • Profit!

Here are two small examples:

  1. Class dc.test1 Extends %Persistent
    {
    
    Index iDE On DescDE As EXACT;
    
    Index iHU On DescHU As EXACT;
    
    Index iRU On DescRU As EXACT;
    
    Property DescDE As %String;
    
    Property DescHU As %String;
    
    Property DescRU As %String;
    
    ClassMethod LoadColTables() [ Private ]
    {
      r=$$$NO
      
      try {
        $$$AddAllRoleTemporaryInTry
        n $namespace
      
        s $namespace="%SYS"
        
        $$$ThrowOnError(##class(Config.NLS.Tables).Load("deuw","COL","German3"))
        $$$ThrowOnError(##class(Config.NLS.Tables).Load("hunw","COL","Hungarian1"))
        $$$ThrowOnError(##class(Config.NLS.Tables).Load("rusw","COL","Cyrillic4"))
        
        ;d ^COLLATE ; for check only
    
        r=$$$YES
      }catch(ex{
        ex.DisplayString(),!
      }
    
      r
    }
    
    /// d ##class(dc.test1).Test()
    ClassMethod Test()
    {
      q:'..LoadColTables()
      
      ..%KillExtent()
      
      &sql(insert into dc.test1(DescDE,DescHU,DescRU)
      select 'Straßer','csak','Прохор' union all
      select 'Straster','comb','Павел' union all
      select 'Strasser','cukor','Пётр')
      
      zw ^dc.test1D
      !,"---- Before (for all indexes by default = 'Cache standard') --------",!!
      print
      
      !,"---- After (each index has its own collation) --------",!!
      ..%PurgeIndices()
      
      d $system.OBJ.DisplayError(##class(%GlobalEdit).Create(,"dc.test1.iDE",##class(%Collate).DisplayToLogical("German3")))
       ,$system.OBJ.DisplayError(##class(%GlobalEdit).Create(,"dc.test1.iHU",##class(%Collate).DisplayToLogical("Hungarian1")))
       ,$system.OBJ.DisplayError(##class(%GlobalEdit).Create(,"dc.test1.iRU",##class(%Collate).DisplayToLogical("Cyrillic4")))
       ,..%BuildIndices(,$$$NO,$$$YES,$$$NO)
       ,$system.SQL.TuneTable($classname(),$$$YES)
       ,$system.OBJ.Compile($classname(),"cu-d")
       ,print
    
      q
    
    print  
      zw ^dc.test1.iDE,^dc.test1.iHU,^dc.test1.iRU
      !
      lng="DE","HU","RU" ##class(%SQL.Statement).%ExecDirect(,$$$FormatText("select %ID,Desc%1 from dc.test1 order by %exact(Desc%1)",lng)).%Display() !!
      q
    }
    
    Storage Default
    {
    <Data name="test1DefaultData">
    <Value name="1">
    <Value>%%CLASSNAME</Value>
    </Value>
    <Value name="2">
    <Value>DescDE</Value>
    </Value>
    <Value name="3">
    <Value>DescHU</Value>
    </Value>
    <Value name="4">
    <Value>DescRU</Value>
    </Value>
    </Data>
    <DataLocation>^dc.test1D</DataLocation>
    <DefaultData>test1DefaultData</DefaultData>
    <IdLocation>^dc.test1D</IdLocation>
    <Index name="iDE">
    <Location>^dc.test1.iDE</Location>
    </Index>
    <Index name="iHU">
    <Location>^dc.test1.iHU</Location>
    </Index>
    <Index name="iRU">
    <Location>^dc.test1.iRU</Location>
    </Index>
    <IndexLocation>^dc.test1I</IndexLocation>
    <StreamLocation>^dc.test1S</StreamLocation>
    <Type>%Library.CacheStorage</Type>
    }
    
    }

    Result:

    USER>##class(dc.test1).Test()
    ^dc.test1D=3
    ^dc.test1D(1)=$lb("","Straßer","csak","Прохор")
    ^dc.test1D(2)=$lb("","Straster","comb","Павел")
    ^dc.test1D(3)=$lb("","Strasser","cukor","Пётр")
     
    ---- Before (for all indexes by default = 'Cache standard') --------
     
    ^dc.test1.iDE("Strasser",3)=""
    ^dc.test1.iDE("Straster",2)=""
    ^dc.test1.iDE("Straßer",1)=""
    ^dc.test1.iHU("comb",2)=""
    ^dc.test1.iHU("csak",1)=""
    ^dc.test1.iHU("cukor",3)=""
    ^dc.test1.iRU("Павел",2)=""
    ^dc.test1.iRU("Прохор",1)=""
    ^dc.test1.iRU("Пётр",3)=""
     
    ID      DescDE
    3       Strasser
    2       Straster
    1       Straßer
     
    3 Rows(s) Affected
     
    ID      DescHU
    2       comb
    1       csak
    3       cukor
     
    3 Rows(s) Affected
     
    ID      DescRU
    2       Павел
    1       Прохор
    3       Пётр
     
    3 Rows(s) Affected
     
     
    ---- After (each index has its own collation) --------
     
    ^dc.test1.iDE("Strasser",3)=""
    ^dc.test1.iDE("Straßer",1)=""
    ^dc.test1.iDE("Straster",2)=""
    ^dc.test1.iHU("comb",2)=""
    ^dc.test1.iHU("cukor",3)=""
    ^dc.test1.iHU("csak",1)=""
    ^dc.test1.iRU("Павел",2)=""
    ^dc.test1.iRU("Пётр",3)=""
    ^dc.test1.iRU("Прохор",1)=""
     
    ID      DescDE
    3       Strasser
    1       Straßer
    2       Straster
     
    3 Rows(s) Affected
     
    ID      DescHU
    2       comb
    3       cukor
    1       csak
     
    3 Rows(s) Affected
     
    ID      DescRU
    2       Павел
    3       Пётр
    1       Прохор
     
    3 Rows(s) Affected
  2. Class dc.test2 Extends %Persistent
    {
    
    Index iDE On (Lng, Descr As EXACT);
    
    Index iHU On (Lng, Descr As EXACT);
    
    Index iRU On (Lng, Descr As EXACT);
    
    Property Lng As %String;
    
    Property Descr As %String;
    
    /// d ##class(dc.test2).Test()
    ClassMethod Test()
    {
      ..%KillExtent()
       ,$system.OBJ.DisplayError(##class(%GlobalEdit).Create(,"dc.test2.iDE",##class(%Collate).DisplayToLogical("German3")))
       ,$system.OBJ.DisplayError(##class(%GlobalEdit).Create(,"dc.test2.iHU",##class(%Collate).DisplayToLogical("Hungarian1")))
       ,$system.OBJ.DisplayError(##class(%GlobalEdit).Create(,"dc.test2.iRU",##class(%Collate).DisplayToLogical("Cyrillic4")))
    
      &sql(insert into dc.test2(Lng,Descr)
      select 'DE''Straßer' union
      select 'DE''Straster' union
      select 'DE''Strasser' union
      select 'HU''csak' union
      select 'HU''comb' union
      select 'HU''cukor' union
      select 'RU''Прохор' union
      select 'RU''Павел' union
      select 'RU''Пётр')
      
      zw ^dc.test2D,^dc.test2.iDE,^dc.test2.iHU,^dc.test2.iRU
      
      d $system.SQL.TuneTable($classname(),$$$YES)
       ,$system.OBJ.Compile($classname(),"cu-d")
       ,print
    
      q
    
    print  
      !
      lng="DE","HU","RU" {
        idx=$replace("iDE,iHU,iRU","i"_lng,"_i"_lng)
        ##class(%SQL.Statement).%ExecDirect(,$$$FormatText("select * from %IGNOREINDEX %1 dc.test2 where Lng=? order by %exact(Descr)",idx),lng).%Display()
        !!
      }
      q
    }
    
    Storage Default
    {
    <Data name="test2DefaultData">
    <Value name="1">
    <Value>%%CLASSNAME</Value>
    </Value>
    <Value name="2">
    <Value>Lng</Value>
    </Value>
    <Value name="3">
    <Value>Descr</Value>
    </Value>
    </Data>
    <DataLocation>^dc.test2D</DataLocation>
    <DefaultData>test2DefaultData</DefaultData>
    <IdLocation>^dc.test2D</IdLocation>
    <Index name="iDE">
    <Location>^dc.test2.iDE</Location>
    </Index>
    <Index name="iHU">
    <Location>^dc.test2.iHU</Location>
    </Index>
    <Index name="iRU">
    <Location>^dc.test2.iRU</Location>
    </Index>
    <IndexLocation>^dc.test2I</IndexLocation>
    <StreamLocation>^dc.test2S</StreamLocation>
    <Type>%Library.CacheStorage</Type>
    }
    
    }

    Result:

    USER>##class(dc.test2).Test()
    ^dc.test2D=9
    ^dc.test2D(1)=$lb("","DE","Straßer")
    ^dc.test2D(2)=$lb("","DE","Straster")
    ^dc.test2D(3)=$lb("","DE","Strasser")
    ^dc.test2D(4)=$lb("","HU","csak")
    ^dc.test2D(5)=$lb("","HU","comb")
    ^dc.test2D(6)=$lb("","HU","cukor")
    ^dc.test2D(7)=$lb("","RU","Прохор")
    ^dc.test2D(8)=$lb("","RU","Павел")
    ^dc.test2D(9)=$lb("","RU","Пётр")
    ^dc.test2.iDE(" DE","Strasser",3)=""
    ^dc.test2.iDE(" DE","Straßer",1)=""
    ^dc.test2.iDE(" DE","Straster",2)=""
    ^dc.test2.iDE(" HU","comb",5)=""
    ^dc.test2.iDE(" HU","csak",4)=""
    ^dc.test2.iDE(" HU","cukor",6)=""
    ^dc.test2.iDE(" RU","Павел",8)=""
    ^dc.test2.iDE(" RU","Прохор",7)=""
    ^dc.test2.iDE(" RU","Пётр",9)=""
    ^dc.test2.iHU(" DE","Strasser",3)=""
    ^dc.test2.iHU(" DE","Straster",2)=""
    ^dc.test2.iHU(" DE","Straßer",1)=""
    ^dc.test2.iHU(" HU","comb",5)=""
    ^dc.test2.iHU(" HU","cukor",6)=""
    ^dc.test2.iHU(" HU","csak",4)=""
    ^dc.test2.iHU(" RU","Павел",8)=""
    ^dc.test2.iHU(" RU","Прохор",7)=""
    ^dc.test2.iHU(" RU","Пётр",9)=""
    ^dc.test2.iRU(" DE","Strasser",3)=""
    ^dc.test2.iRU(" DE","Straster",2)=""
    ^dc.test2.iRU(" DE","Straßer",1)=""
    ^dc.test2.iRU(" HU","comb",5)=""
    ^dc.test2.iRU(" HU","csak",4)=""
    ^dc.test2.iRU(" HU","cukor",6)=""
    ^dc.test2.iRU(" RU","Павел",8)=""
    ^dc.test2.iRU(" RU","Пётр",9)=""
    ^dc.test2.iRU(" RU","Прохор",7)=""
     
    ID      Descr   Lng
    3       Strasser        DE
    1       Straßer DE
    2       Straster        DE
     
    3 Rows(s) Affected
     
    ID      Descr   Lng
    5       comb    HU
    6       cukor   HU
    4       csak    HU
     
    3 Rows(s) Affected
     
    ID      Descr   Lng
    8       Павел   RU
    9       Пётр    RU
    7       Прохор  RU
     
    3 Rows(s) Affected
    

Important: if you remove "As EXACT" in the index, the examples will not work correctly for the German language, because: Capital ẞ

Vitaliy,

That's excellent. I had the vague feeling that it should be somewhere, somehow, ... ass these are all supported collations.

Now I have the correct solution!

Many thanks !

yesyesyesyesyes

I will publish the actual workaround here later when there are other solutions or proposals
in order not to influence your creativity.
########################################################################

OK, There wasn't much echo so far.

The workaround was to encode the Description (and limit it for indexing) according to the language used.
The encoding is basically the position in the list of allowed/used characters or character groups.
If we use also numbers and interpunctuation it has to go into our list.

Instead of %String the descriptions are stored in a serial class:

Class DC.ArticleSort Extends (%SerialObject)
{
Property Language As %String [ InitialExpression = ];
Property Description As %String(MAXLEN = "");
Property Sort As %String(COLLATION = "EXACT", MAXLEN = 300) [  SqlComputed,
       ,
SqlComputeCode = { Set {*} = ##class(DC.ArticleSort).Encode({Language},{Description}) }
      ,
SqlComputeOnChange = (Description, Language) ];

ClassMethod Encode( Lang As %String, Desc As %String(MAXLEN="")) As %String(MAXLEN=300)
{ ...replace valid characters by binary postion value ..  }
}

It's obvious that for English and your installed NLS no encoding is required.
For triplets or doublets like in Hungarian, you need several runs  to encode

The Query itself has a minor change

SELECT TOP 10 ArtNr,DescHU_Description FROM DC.Article
      WHERE DescHU_Description %STARTSWITH 'BE'
      ORDER BY DescHU_Sort

The encoding is definitely not fast but change frequency is not an issue here and change happens
by typing in descriptions on the keyboard.

Query performance is supported by this index construct:

     Index IdxHU On DescHU.Sort [ Data = (ArtNr, DescHU.Description) ]; 

So the query plan just iterates over this index.
 

What's the purpose of a normal index over a text field? I assume user searches by a substring. There are two and a half ways to perform a fulltext search:

  1. Define a property of %Text type and use language class. New language classes an be added fairly easily. Documentation. Removed in InterSystems IRIS and therefore not recommended.
  2. Use iFind indices. They  are highly customisable.
  3. Use iFind indices creatively. You can build your own indices based on iFind indices and you can write your own tokenizers, etc. Also it's possible to define your own custom index. Community article on that.

#1) thanks for the Reminder on  IRIS

#1) + #2) concentrate on finding. but that's not the key issues.

As an example: The problem becomes visible when you run   

SELECT TOP 10 ArtNr,DescDE FROM Dc.Article WHERE DescDE %Startswith 'BE' ORDER by DescDE 


The problem is to get control of ORDER which depends on global collation.

Example for German:
- With Standard collation, you sort A,B,C..,O,...,U,...Z,Ä,Ö,Ü...   (classic ANSI sort)
- With collation German3, you sort A,Ä,B,C,....O,Ö,....U,Ü,....Z 

It is mostly the handling of characters with diacritical signs.

Hungarian is my worst case with much more diacriticals AND groups of characters in total  44 "character" tokens

A Á B C Cs D Dz Dzs E É F G Gy H I Í J K L Ly M N Ny O Ó Ö Ő P Q R S Sz T Ty U Ú Ü Ű V W X Y Z Zs

If you do it correctly then Cx sorts before Cs and Gz should sort before Gy and so on.

Now for ORDER BY you typically get your default collation derived from NLS settings.

#3) custom index seems to offer the most promising features.