Managing UTF-8 characters on the database with a REST application
Cache for UNIX (Red Hat Enterprise Linux for x86-64) 2016.2 (Build 736U) Fri Sep 30 2016 12:25:56 EDT
The server is on RedHat Linux while the browser is running on Windows (Firefox).
I am helping a partner build a complex modern HTML5 web application that talks with Caché using REST calls. They have been using %CSP.REST very successfully. Security works great as well.
Some time ago, I had defined the parameter CONVERTINPUTSTREAM = 1 on their %CSP.REST classes and that seemed to work. Their pages were sending UTF-8 strings and because of this parameter we were storing them on our database as ASCII (I guess). At least, when looking at the global and doing SELECTs on tables I could see the data there.
Here is the description of CONVERTINPUTSTREAM parameter from %CSP.Page (from where %CSP.REST inherits from):
/// original character set on input. By default (0) we do not modify these and receive them
/// as a binary stream which may need to be converted manually later. If 1 then if there
/// is a 'charset' value in the request Content-Type or mime section we will convert from this
/// charset when the input data is text based. For either json or xml data with no charset
/// this will convert from utf-8 or honor the BOM if one is present.
Parameter CONVERTINPUTSTREAM = 0;
What I didn't realize at the time was that when the same page that submitted the UTF-8 data requested the data back for displaying, we wouldn't re-convert the data back to UTF-8. We would send whatever was in the database and the browser would display it fine, unless there were special characters such as áéíóúçÁÉÍÓÚ etc... Those would be shown on the page as a weird symbol.
When I realized that, I removed the CONVERTINPUTSTREAM from my %CSP.REST page. Now, the data that is being sent as UTF-8 gets stored on the database as UTF-8. When the data is sent back to the application, it is sent also as UTF-8 and the application seems to work fine. But when looking at the global or at the tables, what I see now is something like:
As you can notice, line 9 is unreadable because it uses UTF-8 encoding while the database uses something else (ASCII?). That should read "áéíóú".
So, I am now unable, for instance, to query for descriptions that %STARTSWITH "á". At least not from the Caché Management Portal. I CAN do the search through the Web Application without problems. I can also do the search from the terminal as long as I enter an UTF-8 search string. For instance:
USER>s oRS=##class(%ResultSet).%New() USER>w oRS.Prepare("select ID from Task where Description %StartsWith ? ") 1 USER>Set tSearchString=$ZConvert("á","O","UTF8") USER>Write oRS.Execute(tSearchString) 1 USER>Write oRS.Next() 1 USER>Write oRS.GetData(1) 9
Conclusion: It should work. But I am not sure if that is the best approach.
I was thinking about go back and set CONVERTINPUTSTREAM back to 1. But then, every time someone wants to send data to the browser, we should convert it with $ZConvert(string,"O","UTF8"). That doesn't look very elegant to me.
So, my question is: Is there another way of making this work more transparently for the developer? The client is using espw locale, that is pretty much the same as enuw except for number separators, currency symbols, date formats and default collation (Spanish 5 instead of Caché Standard). I have also set the following global during the troubleshooting process:
Parameter UseSession = 1; Parameter CHARSET = "utf-8"; Parameter CONTENTTYPE = "application/json";
And the CSP page that uses this REST service is starts like this:
<!DOCTYPE html> <html> <head> <CSP:PARAMETER name="SECURITYRESOURCE" value="AUPOL_Task"/> <CSP:PARAMETER Name="CHARSET" Value="utf-8"> <title>My tasks</title> <meta charset="utf-8" /> <csp:include page="includes/libraries.csp"/> </head>