Discussion
Benjamin De Boe · Nov 9, 2020

Accent sensitivity and other internationalization issues

Hi,

InterSystems IRIS has long supported the obvious translation functions required to for converting to upper or lowercase to enforce case-insensitive string comparison (e.g. in ObjectScript with $zconvert) and sorting (e.g. with SQL collation functions, not to be confused with NLS collation). Customers in international contexts have at times used custom workarounds to also treat accent insensitivity or even more advanced normalization duct tape. We’re looking to address such use cases at the system and SQL level to increase convenience for this international audience, which is well represented on the Developer Community.

No rocket science here, but we hope to make this comprehensive yet at the same time keep it practical to build in and easy to use for customers (many of which will only need the basics):

  • Extend $zconvert() with an "A" flag (next to "U" and "L") for stripping accents based on the recommended ICU transformation. This will be implemented through an internal translation table which can be customized if need by (as has always been the case in IRIS)
  • Add a new %SQLGENERIC( transformation-spec [, truncate-length] ) collation function for use in DDL (defining columns or indices) or SQL queries,  using parameters to control the primary "transformations" our existing collation functions offer (uppercasing, trimming whitespace, ...), plus the new accent stripping feature.

To illustrate how this would look, here's how you could express the existing collation functions using %SQLGENERIC:

Current function Using %SQLGENERIC()
%EXACT %SQLGENERIC
%SQLSTRING(n) %SQLGENERIC('prepend-space, strip-trailing-whitespace', n)
%SQLUPPER(n) %SQLGENERIC('prepend-space, strip-trailing-whitespace, mode=U', n)
%TRUNCATE(n) %SQLGENERIC('', n)
%MINUS %SQLGENERIC('mode=-')
%PLUS %SQLGENERIC('mode=+')

If you have faced such requirements and/or implemented any of those creative workarounds in the past, please use the poll below or chime in how you've approached this in the past and how you think the above addresses this.

PLEASE NOTE: this is a proposal shared here for soliciting your valued feedback. We are not committing to this specific approach nor a particular target release at this point. We are awaiting your feedback, after all :-)

 

Thanks,
benjamin

Would this help address your Internationalization challenges?

Register or login to poll

Results
66.67%
(2)
I've never run into this issue, but thanks!
33.33%
(1)
Yes, this would cut my custom hacker code by a fair bit!
0%
(0)
Looks OK, but I'll stick with my custom solution, which is to ...
0%
(0)
No, this does not appear to solve the following challenge for me ...
10
0 0 5 102
Log in or sign up to continue

Replies

I've never faced with accents, so, not sure about this case. But I see many useful use cases for using slugify in SQL. But this feature looks more complex in realization. There are many realizations in many languages, but no standard at all.

For the info, slug, slugify, translates string in any language to ASCII, URL compatible string.

For instance, it would help to get a cost-effective, language-independent index, but with a quite correct order.

The ICU does suggest standardization (or at least standardized nomenclature) for script transliteration, which I believe is the more boring word for slugification :-)

What are you referring to with "cost-effective"? Just the avoidance of wide characters or something else?

There is my related question Multi Language Sort  and the new extension may cover some very simple cases.

@Vitaliy Serdtsev provided an excellent solution as reply to the problem
including a detailed example of how to cover solve the issues I found
with basic features already available in Caché since almost ever(?).

Thanks for the reference. That's indeed a very good approach to solve the international-exact-sort question through NLS collations (see also this note). The new SQL collation described above is meant to provide an easy way to have an international-broad-brush transformation to accommodate the non-exact cases, such as using a LIKE operator that doesn't trip over a single-accent difference.