translating macintosh escape sequences to the nearest equivilent eg. %u201C and %u201D is actually the quote marks

JavaScript, Caché

the boss started using his new mac laptop, and now we are getting macintosh's own html escape characters being stored in the database especially in our text blocks, when we then print it we see things like don%u2019t instead of don't

so for example instead of the apostrophe we see %u2019 being stored. We've already noticed %u2018, %u2019, %u201C, %u201D, %2026

I'd like to replace the %u2019 with the nearest equivilent (especially when he starts to use left and right quote marks)

so how can I efficiently start to replace the
%u2019 with the apostrophe
%u2026 with the "..."

and there's bound to be a lot of others that I haven't yet noticed (although on the bosss laptop, it would be haven%u2019t )

I've noticed the data in once specific class at the moment, but I'm assuming if its in one class data, it would end up in others, so I'd like to find a "cross-class" solution

we currently  DO NOT use unicode (never had a need to so far), but even then, the printing just prints don%u2019t instead of don't

we have a normal basic cache installation, we don't use ensemble, deepsee etc so I'm looking for answers that can be used anywhere

any ideas please

kevin

  • 0
  • 0
  • 194
  • 0
  • 2

Answers

As you know these escape sequences are valid HTML escaping of unicode characters. The general principal is always that you store the text in the database  as characters i.e. not escaped at all and you apply any escaping needed when serving this content to a client. So it appears you need to convert these escaped characters into something you can store in your 8bit database.

Now in general I would suggest using unicode in which case you can just make sure the data being sent to you is correctly converted into unicode characters and then you just store the characters in the database. This would then work with any characters and not just the few you are having problems with. However it sounds like you do not want to move from 8bit to unicode. If that is the case anything you do will be something of a hack, but you can just use $replace on the data coming in to convert say "%u2019" to "'" before you store it in the database to 'normalize' the input. This solution will only cover a few characters where you can find a suitable replacement but it may be enough to get by for the short term while you investigate moving to unicode as a permanent solution.

 

If you have a web application, you can enforce codepage on a client side, there are several advantages to it:

  • Conversions do not add the load to the server
  • JS client libraries are well equipped to deal with various OSes/browsers combinations

The best approach would be to move to unicode.