Multi Language Sort

Question

Question

Robert Cemper · Jun 10, 2018

#Indexing #SQL #Caché #InterSystems IRIS

I'm facing a specific sort problem.
There are several thousands of articles sold all over.
Users expect to get a description in local language sorted by their specific collation.

I have simplified the original class to the core fields for this case.

Class DC.Article Extends %Persistent [ Final ] { Property ArtNr As %Integer(MINVAL = 0) [ Required ]; /// default Description Property Description As %String [ Required ]; /// German Property DescDE As %String; /// Italian Property DescIT As %String; /// English Property DescEN As %String; /// French Property DescFR As %String; /// Czech Property DescCZ As %String; /// Slovakian Property DescSL As %String; /// Hungarian Property DescHU As %String; /// Bulgarian Property DescBG As %String; /// Romanian Property DescRO As %String; /// Slovenian Property DescSI As %String; Index Idx On ArtNr [ IdKey ]; Index IdxDE On DescDE; Index IdxHU On DescHU; Index IdxDE On DescEN; . . . }

The whole construct serves as multi language lookup table.
It works based on the defined Indices [here only 3 for this example]

The problem to be solved is to have the correct language-specific collation
Some languages overlap nicely (eg. EN + DE) but others (esp. HU) don't match
or are totally separated (e.g. BG in Cyrillic notation)
just see example: https://en.wikipedia.org/wiki/Hungarian_alphabet

There is actually some hand-coded workaround in place to enforce collation but performance is not thrilling.

I hope you have some brilliant suggestions to get better solutions.

Discussion (6)0

Log in or sign up to continue

Robert Cemper · Jun 14, 2018

OK, There wasn't much echo so far.

The workaround was to encode the Description (and limit it for indexing) according to the language used.
The encoding is basically the position in the list of allowed/used characters or character groups.
If we use also numbers and interpunctuation it has to go into our list.

Instead of %String the descriptions are stored in a serial class:

Class DC.ArticleSort Extends (%SerialObject) { Property Language As %String [ InitialExpression = 0 ]; Property Description As %String(MAXLEN = ""); Property Sort As %String(COLLATION = "EXACT", MAXLEN = 300) [ SqlComputed, ,SqlComputeCode = { Set {*} = ##class(DC.ArticleSort).Encode({Language},{Description}) } , SqlComputeOnChange = (Description, Language) ]; ClassMethod Encode( Lang As %String, Desc As %String(MAXLEN="")) As %String(MAXLEN=300) { ...replace valid characters by binary postion value .. } }

It's obvious that for English and your installed NLS no encoding is required.
For triplets or doublets like in Hungarian, you need several runs to encode

The Query itself has a minor change

SELECT TOP 10 ArtNr,DescHU_Description FROM DC.Article WHERE DescHU_Description %STARTSWITH 'BE' ORDER BY DescHU_Sort

The encoding is definitely not fast but change frequency is not an issue here and change happens
by typing in descriptions on the keyboard.

Query performance is supported by this index construct:

Index IdxHU On DescHU.Sort [ Data = (ArtNr, DescHU.Description) ];

So the query plan just iterates over this index.

0 0

Robert Cemper · Jun 11, 2018

#1) thanks for the Reminder on IRIS

#1) + #2) concentrate on finding. but that's not the key issues.

As an example: The problem becomes visible when you run

SELECT TOP 10 ArtNr,DescDE FROM Dc.Article WHERE DescDE %Startswith 'BE' ORDER by DescDE

The problem is to get control of ORDER which depends on global collation.

Example for German:
- With Standard collation, you sort A,B,C..,O,...,U,...Z,Ä,Ö,Ü... (classic ANSI sort)
- With collation German3, you sort A,Ä,B,C,....O,Ö,....U,Ü,....Z

It is mostly the handling of characters with diacritical signs.

Hungarian is my worst case with much more diacriticals AND groups of characters in total 44 "character" tokens

A Á B C Cs D Dz Dzs E É F G Gy H I Í J K L Ly M N Ny O Ó Ö Ő P Q R S Sz T Ty U Ú Ü Ű V W X Y Z Zs

If you do it correctly then Cx sorts before Cs and Gz should sort before Gy and so on.

Now for ORDER BY you typically get your default collation derived from NLS settings.

#3) custom index seems to offer the most promising features.

1 0

Vitaliy Serdtsev · Jun 22, 2018

Useful links:

This issue dares relatively simply:

Create the required collation tables. Fortunately, for some languages, they are already included and can be taken as a basis. In complex cases, you can contact to WRC.
Check their list and status:

USER>zn "%SYS"
 
%SYS>d ^COLLATE
 
Status       Number   Abbrev   Name
----------   ------   ------   ----------------------
Built-in        0     OANS     ISM Pre-6.2
Built-in        1     ANSI     ISM 6.2->6.4
Built-in        2     COBR     Ipsum/Cobra
Built-in        3     DTMC     DTM-compatible
Built-in        4     CBR2     Ipsum/Cobra-2
Built-in        5     UNIC     Cache standard
Not loaded     10     GER1     German1
Not loaded     11     POR1     Portuguese1
Not loaded     12     POL1     Polish1
Not loaded     13     GER2     German2
Not loaded     14     SPA1     Spanish1
Not loaded     15     DAN1     Danish1
Available      16     CYR1     Cyrillic1
Not loaded     17     GRE1     Greek1
Not loaded     18     CZE1     Czech1
Not loaded     19     CZE2     Czech2
Not loaded     20     POR2     Portuguese2
Not loaded     21     FIN1     Finnish1
Not loaded     22     JAP1     Japanese1
Not loaded     24     POL2     Polish2
Not loaded     27     FRE1     French1
Not loaded     28     FIN2     Finnish2
Available      29     HUN1     Hungarian1
Available      30     GER3     German3
Not loaded     31     POL3     Polish3
Not loaded     32     SPA2     Spanish2
Not loaded     33     DAN2     Danish2
Not loaded     34     GRE2     Greek2
Not loaded     35     FIN3     Finnish3
Not loaded     36     LIT1     Lithuanian1
Available      37     CYR3     Cyrillic3
Not loaded     38     SLO1     Slovenian1
Not loaded     39     SLO2     Slovenian2
Not loaded     40     TUR1     Turkish1
Not loaded     41     DAN3     Danish3
Available      42     UKR1     Ukrainian1
Available      43     CYR4     Cyrillic4
Not loaded     44     CZE3     Czech3
Not loaded     46     MAL1     Maltese1
Not loaded     48     MAL2     Maltese2
Not loaded     49     SPA4     Spanish4
Not loaded     50     SLO1     Slovak1
Not loaded     51     SPA5     Spanish5
Not loaded     52     FIN4     Finnish4
Not loaded     53     CZE4     Czech4
Built-in      128     OSTR     ISM Pre-6.2 string
Built-in      129     NSTR     ISM 6.2->6.4 string
Built-in      133     USTR     Cache standard string
Available     144     CYR1S    Cyrillic1 string
Available     157     HUN1S    Hungarian1 string
Available     158     GER3S    German3 string
Available     165     CYR3S    Cyrillic3 string
Available     170     UKR1S    Ukrainian1 string
Available     171     CYR4S    Cyrillic4 string

load the new collation tables to your current locale
for Storage make mapping your indexes in separate globals with own collation
Profit!

Here are two small examples:

Class dc.test1 Extends %Persistent
{

Index iDE On DescDE As EXACT;

Index iHU On DescHU As EXACT;

Index iRU On DescRU As EXACT;

Property DescDE As %String;

Property DescHU As %String;

Property DescRU As %String;

ClassMethod LoadColTables() [ Private ]
{
  s r=$$$NO
  
  try {
    $$$AddAllRoleTemporaryInTry
    n $namespace
  
    s $namespace="%SYS"
    
    $$$ThrowOnError(##class(Config.NLS.Tables).Load("deuw","COL","German3"))
    $$$ThrowOnError(##class(Config.NLS.Tables).Load("hunw","COL","Hungarian1"))
    $$$ThrowOnError(##class(Config.NLS.Tables).Load("rusw","COL","Cyrillic4"))
    
    ;d ^COLLATE ; for check only

    s r=$$$YES
  }catch(ex) {
    w ex.DisplayString(),!
  }

  q r
}

/// d ##class(dc.test1).Test()
ClassMethod Test()
{
  q:'..LoadColTables()
  
  d ..%KillExtent()
  
  &sql(insert into dc.test1(DescDE,DescHU,DescRU)
  select 'Straßer','csak','Прохор' union all
  select 'Straster','comb','Павел' union all
  select 'Strasser','cukor','Пётр')
  
  zw ^dc.test1D
  w !,"---- Before (for all indexes by default = 'Cache standard') --------",!!
  d print
  
  w !,"---- After (each index has its own collation) --------",!!
  d ..%PurgeIndices()
  
  d $system.OBJ.DisplayError(##class(%GlobalEdit).Create(,"dc.test1.iDE",##class(%Collate).DisplayToLogical("German3")))
   ,$system.OBJ.DisplayError(##class(%GlobalEdit).Create(,"dc.test1.iHU",##class(%Collate).DisplayToLogical("Hungarian1")))
   ,$system.OBJ.DisplayError(##class(%GlobalEdit).Create(,"dc.test1.iRU",##class(%Collate).DisplayToLogical("Cyrillic4")))
   ,..%BuildIndices(,$$$NO,$$$YES,$$$NO)
   ,$system.SQL.TuneTable($classname(),$$$YES)
   ,$system.OBJ.Compile($classname(),"cu-d")
   ,print

  q

print  
  zw ^dc.test1.iDE,^dc.test1.iHU,^dc.test1.iRU
  w !
  f lng="DE","HU","RU" d ##class(%SQL.Statement).%ExecDirect(,$$$FormatText("select %ID,Desc%1 from dc.test1 order by %exact(Desc%1)",lng)).%Display() w !!
  q
}

Storage Default
{
<Data name="test1DefaultData">
<Value name="1">
<Value>%%CLASSNAME</Value>
</Value>
<Value name="2">
<Value>DescDE</Value>
</Value>
<Value name="3">
<Value>DescHU</Value>
</Value>
<Value name="4">
<Value>DescRU</Value>
</Value>
</Data>
<DataLocation>^dc.test1D</DataLocation>
<DefaultData>test1DefaultData</DefaultData>
<IdLocation>^dc.test1D</IdLocation>
<Index name="iDE">
<Location>^dc.test1.iDE</Location>
</Index>
<Index name="iHU">
<Location>^dc.test1.iHU</Location>
</Index>
<Index name="iRU">
<Location>^dc.test1.iRU</Location>
</Index>
<IndexLocation>^dc.test1I</IndexLocation>
<StreamLocation>^dc.test1S</StreamLocation>
<Type>%Library.CacheStorage</Type>
}

}

Result:

USER>d ##class(dc.test1).Test()
^dc.test1D=3
^dc.test1D(1)=$lb("","Straßer","csak","Прохор")
^dc.test1D(2)=$lb("","Straster","comb","Павел")
^dc.test1D(3)=$lb("","Strasser","cukor","Пётр")
 
---- Before (for all indexes by default = 'Cache standard') --------
 
^dc.test1.iDE("Strasser",3)=""
^dc.test1.iDE("Straster",2)=""
^dc.test1.iDE("Straßer",1)=""
^dc.test1.iHU("comb",2)=""
^dc.test1.iHU("csak",1)=""
^dc.test1.iHU("cukor",3)=""
^dc.test1.iRU("Павел",2)=""
^dc.test1.iRU("Прохор",1)=""
^dc.test1.iRU("Пётр",3)=""
 
ID      DescDE
3       Strasser
2       Straster
1       Straßer
 
3 Rows(s) Affected
 
ID      DescHU
2       comb
1       csak
3       cukor
 
3 Rows(s) Affected
 
ID      DescRU
2       Павел
1       Прохор
3       Пётр
 
3 Rows(s) Affected
 
 
---- After (each index has its own collation) --------
 
^dc.test1.iDE("Strasser",3)=""
^dc.test1.iDE("Straßer",1)=""
^dc.test1.iDE("Straster",2)=""
^dc.test1.iHU("comb",2)=""
^dc.test1.iHU("cukor",3)=""
^dc.test1.iHU("csak",1)=""
^dc.test1.iRU("Павел",2)=""
^dc.test1.iRU("Пётр",3)=""
^dc.test1.iRU("Прохор",1)=""
 
ID      DescDE
3       Strasser
1       Straßer
2       Straster
 
3 Rows(s) Affected
 
ID      DescHU
2       comb
3       cukor
1       csak
 
3 Rows(s) Affected
 
ID      DescRU
2       Павел
3       Пётр
1       Прохор
 
3 Rows(s) Affected

Class dc.test2 Extends %Persistent
{

Index iDE On (Lng, Descr As EXACT);

Index iHU On (Lng, Descr As EXACT);

Index iRU On (Lng, Descr As EXACT);

Property Lng As %String;

Property Descr As %String;

/// d ##class(dc.test2).Test()
ClassMethod Test()
{
  d ..%KillExtent()
   ,$system.OBJ.DisplayError(##class(%GlobalEdit).Create(,"dc.test2.iDE",##class(%Collate).DisplayToLogical("German3")))
   ,$system.OBJ.DisplayError(##class(%GlobalEdit).Create(,"dc.test2.iHU",##class(%Collate).DisplayToLogical("Hungarian1")))
   ,$system.OBJ.DisplayError(##class(%GlobalEdit).Create(,"dc.test2.iRU",##class(%Collate).DisplayToLogical("Cyrillic4")))

  &sql(insert into dc.test2(Lng,Descr)
  select 'DE', 'Straßer' union
  select 'DE', 'Straster' union
  select 'DE', 'Strasser' union
  select 'HU', 'csak' union
  select 'HU', 'comb' union
  select 'HU', 'cukor' union
  select 'RU', 'Прохор' union
  select 'RU', 'Павел' union
  select 'RU', 'Пётр')
  
  zw ^dc.test2D,^dc.test2.iDE,^dc.test2.iHU,^dc.test2.iRU
  
  d $system.SQL.TuneTable($classname(),$$$YES)
   ,$system.OBJ.Compile($classname(),"cu-d")
   ,print

  q

print  
  w !
  f lng="DE","HU","RU" {
    s idx=$replace("iDE,iHU,iRU","i"_lng,"_i"_lng)
    d ##class(%SQL.Statement).%ExecDirect(,$$$FormatText("select * from %IGNOREINDEX %1 dc.test2 where Lng=? order by %exact(Descr)",idx),lng).%Display()
    w !!
  }
  q
}

Storage Default
{
<Data name="test2DefaultData">
<Value name="1">
<Value>%%CLASSNAME</Value>
</Value>
<Value name="2">
<Value>Lng</Value>
</Value>
<Value name="3">
<Value>Descr</Value>
</Value>
</Data>
<DataLocation>^dc.test2D</DataLocation>
<DefaultData>test2DefaultData</DefaultData>
<IdLocation>^dc.test2D</IdLocation>
<Index name="iDE">
<Location>^dc.test2.iDE</Location>
</Index>
<Index name="iHU">
<Location>^dc.test2.iHU</Location>
</Index>
<Index name="iRU">
<Location>^dc.test2.iRU</Location>
</Index>
<IndexLocation>^dc.test2I</IndexLocation>
<StreamLocation>^dc.test2S</StreamLocation>
<Type>%Library.CacheStorage</Type>
}

}

Result:

USER>d ##class(dc.test2).Test()
^dc.test2D=9
^dc.test2D(1)=$lb("","DE","Straßer")
^dc.test2D(2)=$lb("","DE","Straster")
^dc.test2D(3)=$lb("","DE","Strasser")
^dc.test2D(4)=$lb("","HU","csak")
^dc.test2D(5)=$lb("","HU","comb")
^dc.test2D(6)=$lb("","HU","cukor")
^dc.test2D(7)=$lb("","RU","Прохор")
^dc.test2D(8)=$lb("","RU","Павел")
^dc.test2D(9)=$lb("","RU","Пётр")
^dc.test2.iDE(" DE","Strasser",3)=""
^dc.test2.iDE(" DE","Straßer",1)=""
^dc.test2.iDE(" DE","Straster",2)=""
^dc.test2.iDE(" HU","comb",5)=""
^dc.test2.iDE(" HU","csak",4)=""
^dc.test2.iDE(" HU","cukor",6)=""
^dc.test2.iDE(" RU","Павел",8)=""
^dc.test2.iDE(" RU","Прохор",7)=""
^dc.test2.iDE(" RU","Пётр",9)=""
^dc.test2.iHU(" DE","Strasser",3)=""
^dc.test2.iHU(" DE","Straster",2)=""
^dc.test2.iHU(" DE","Straßer",1)=""
^dc.test2.iHU(" HU","comb",5)=""
^dc.test2.iHU(" HU","cukor",6)=""
^dc.test2.iHU(" HU","csak",4)=""
^dc.test2.iHU(" RU","Павел",8)=""
^dc.test2.iHU(" RU","Прохор",7)=""
^dc.test2.iHU(" RU","Пётр",9)=""
^dc.test2.iRU(" DE","Strasser",3)=""
^dc.test2.iRU(" DE","Straster",2)=""
^dc.test2.iRU(" DE","Straßer",1)=""
^dc.test2.iRU(" HU","comb",5)=""
^dc.test2.iRU(" HU","csak",4)=""
^dc.test2.iRU(" HU","cukor",6)=""
^dc.test2.iRU(" RU","Павел",8)=""
^dc.test2.iRU(" RU","Пётр",9)=""
^dc.test2.iRU(" RU","Прохор",7)=""
 
ID      Descr   Lng
3       Strasser        DE
1       Straßer DE
2       Straster        DE
 
3 Rows(s) Affected
 
ID      Descr   Lng
5       comb    HU
6       cukor   HU
4       csak    HU
 
3 Rows(s) Affected
 
ID      Descr   Lng
8       Павел   RU
9       Пётр    RU
7       Прохор  RU
 
3 Rows(s) Affected

Important: if you remove "As EXACT" in the index, the examples will not work correctly for the German language, because: Capital ẞ

4 0

score 0 · Answer 1 · 2018-06-11T00:00:07-04:00

I will publish the actual workaround here later when there are other solutions or proposals
in order not to influence your creativity.
########################################################################

score 0 · Answer 2 · 2018-06-22T09:28:00-04:00

Vitaliy,

That's excellent. I had the vague feeling that it should be somewhere, somehow, ... ass these are all supported collations.

Now I have the correct solution!

Many thanks !

score 1 · Answer 3 · 2018-06-11T03:35:17-04:00

What's the purpose of a normal index over a text field? I assume user searches by a substring. There are two and a half ways to perform a fulltext search:

Define a property of %Text type and use language class. New language classes an be added fairly easily. Documentation. Removed in InterSystems IRIS and therefore not recommended.
Use iFind indices. They are highly customisable.
Use iFind indices creatively. You can build your own indices based on iFind indices and you can write your own tokenizers, etc. Also it's possible to define your own custom index. Community article on that.