Written by

InterSystems Corporation

STAFF

Discussion Benjamin De Boe · Nov 9, 2020

Accent sensitivity and other internationalization issues

Hi,

InterSystems IRIS has long supported the obvious translation functions required to for converting to upper or lowercase to enforce case-insensitive string comparison (e.g. in ObjectScript with $zconvert) and sorting (e.g. with SQL collation functions, not to be confused with NLS collation). Customers in international contexts have at times used custom workarounds to also treat accent insensitivity or even more advanced normalization duct tape. We’re looking to address such use cases at the system and SQL level to increase convenience for this international audience, which is well represented on the Developer Community.

No rocket science here, but we hope to make this comprehensive yet at the same time keep it practical to build in and easy to use for customers (many of which will only need the basics):

Extend $zconvert() with an "A" flag (next to "U" and "L") for stripping accents based on the recommended ICU transformation. This will be implemented through an internal translation table which can be customized if need by (as has always been the case in IRIS)
Add a new %SQLGENERIC( transformation-spec [, truncate-length] ) collation function for use in DDL (defining columns or indices) or SQL queries, using parameters to control the primary "transformations" our existing collation functions offer (uppercasing, trimming whitespace, ...), plus the new accent stripping feature.

To illustrate how this would look, here's how you could express the existing collation functions using %SQLGENERIC:

Current function	Using %SQLGENERIC()
`%EXACT`	`%SQLGENERIC`
`%SQLSTRING(n)`	`%SQLGENERIC('prepend-space, strip-trailing-whitespace', n)`
`%SQLUPPER(n)`	`%SQLGENERIC('prepend-space, strip-trailing-whitespace, mode=U', n)`
`%TRUNCATE(n)`	`%SQLGENERIC('', n)`
`%MINUS`	`%SQLGENERIC('mode=-')`
`%PLUS`	`%SQLGENERIC('mode=+')`

If you have faced such requirements and/or implemented any of those creative workarounds in the past, please use the poll below or chime in how you've approached this in the past and how you think the above addresses this.

PLEASE NOTE: this is a proposal shared here for soliciting your valued feedback. We are not committing to this specific approach nor a particular target release at this point. We are awaiting your feedback, after all :-)

Thanks,
benjamin

Would this help address your Internationalization challenges?

Discussion (5)0

Add reply

Comments

Dmitry Maslennikov · Nov 9, 2020

I've never faced with accents, so, not sure about this case. But I see many useful use cases for using slugify in SQL. But this feature looks more complex in realization. There are many realizations in many languages, but no standard at all.

For the info, slug, slugify, translates string in any language to ASCII, URL compatible string.

For instance, it would help to get a cost-effective, language-independent index, but with a quite correct order.

0 0

Benjamin De Boe Nov 9, 2020 to Dmitry Maslennikov

The ICU does suggest standardization (or at least standardized nomenclature) for script transliteration, which I believe is the more boring word for slugification :-)

What are you referring to with "cost-effective"? Just the avoidance of wide characters or something else?

0 0

Dmitry Maslennikov Nov 9, 2020 to Benjamin De Boe

Yes, just store fewer bytes in the database.

0 0

Robert Cemper · Nov 10, 2020

There is my related question Multi Language Sort and the new extension may cover some very simple cases.

@Vitaliy Serdtsev provided an excellent solution as reply to the problem
including a detailed example of how to cover solve the issues I found
with basic features already available in Caché since almost ever(?).

0 0

Benjamin De Boe Nov 10, 2020 to Robert Cemper

Thanks for the reference. That's indeed a very good approach to solve the international-exact-sort question through NLS collations (see also this note). The new SQL collation described above is meant to provide an easy way to have an international-broad-brush transformation to accommodate the non-exact cases, such as using a LIKE operator that doesn't trip over a single-accent difference.

0 0