Article
· May 6 3m read

How to remove special characters (Unicode Characters) from text

It helps to remove special characters, such as non-utf-8 characters either control characters or unicode characters from text that is not printable or can't be parsed by downstream systems.

There is also $C(32) in this condition; sometimes NBSP appears in the text and it will not be recognized by TIE, but downstream it displays as "?".

In order to avoid the NBSP issue, the if condition is replaced with a space in order to prevent the error.

Unicode characters only Remove:

Class Test.Utility.FunctionSet Extends %RegisteredObject
{ 
ClassMethod ConvertTextToAscii(text As %String) As %String
{
 Set (str,char) = ""

 FOR i=1:1:$LENGTH(text)
 {
      Set char = $E(text, i)
      Set ascii = $ASCII(char)
      
      IF ascii'<33,ascii>126 {       
          ///ascii 32 included to avoid NBSP
      "Removing"_" ASCII code of "_ascii_" character is: "_char,!
      Set char = " "
      }
      Set str = str_char
    }
    Set str = $Replace(str," ","")
      
 QUIT str
}

}

Output:

W ##class(Test.Utility.FunctionSet).ConvertTextToAscii( "This”text is€example for × special € unicode — characters.•!!£;: with in the text? @downstream systems <removing (-)them> from {+adding some $%^'#utf-8'-_¦¬test@hotmail.com!"),!
Removing ASCII code of 8221 character is: ”
Removing ASCII code of 8364 character is: €
Removing ASCII code of 160 character is:  
Removing ASCII code of 215 character is: ×
Removing ASCII code of 8364 character is: €
Removing ASCII code of 8212 character is: —
Removing ASCII code of 8226 character is: •
Removing ASCII code of 163 character is: £
Removing ASCII code of 166 character is: ¦
Removing ASCII code of 172 character is: ¬
This text is example for special unicode characters. !! ;: with in the text? @downstream systems <removing (-)them> from {+adding some $%^'#utf-8'-_test@hotmail.com!

 

Including Control characters as well as Unicode characters to strip from Text:

ClassMethod ConvertTextToAscii(text As %String = "Testing unicode”charactersand€control characters×removing unicode—andcharacters.•and!£;:?$%^'#utf-8'-_¦¬test@hotmail.com!") As %String
{
Set (str,char) = "" FOR i=1:1:$LENGTH(text)
    {
      Set char = $E(text, i)
      Set ascii = $ASCII(char)
      //W "Ascii of char "_char_" is"_ascii,!
      IF ascii<33!(ascii>126) {       
          ///ascii 32 included to avoid NBSP
      "Removing"_" ASCII code of "_ascii_" character is: "_char,!
      Set char = " "
      }
      Set str = str_char
    }
    Set str = $Replace(str," ","")
      
 QUIT str
}


 

Discussion (4)3
Log in or sign up to continue

If you receive "special characters, such as non-utf-8 characters either control characters or unicode characters" it means that somewhere upstream character set conversion is not properly configured/handled.

I'm not sure that removing characters from a text is a proper solution, instead I'd fix the problem from the source identifying where the character set conversion is not properly configured/handled and fixing it.

Surely after removing some characters the text can be printed and  parsed by downstream systems, but....it's going to be a different text, potentially with a different meaning!