Question
· Jun 17, 2019

Regular expressions - Any tools for caché?

Hi guys,

I have never used regular expressions with caché object script before, so I've read through the documentation, but I am struggling with things that I'd like some help on.

My need is to identify specific html tags and transform them out to something else. Example:

I need to identify those guys in a wider string:
"<span style=\""font-family: '';font-weight: bold;\"" >BOLD TEXT</span>"
"<span font-style:italic;\" >ITALIC TEXT</span>"

And then they'll become:
"<b>BOLD TEXT</b>"
"<i>ITALIC TEXT</i>"

I started off by using this regex pattern https://community.intersystems.com/post/regular-expression-strip-html-tags

I played around a bit with the code, wrote some stuff further and:

1st thing is, I can't get the "BANG-MATCH" text written. Why?

Set tRegEx = "<[^>]*>"
Set htmlSnippet = "<h1>Hello1</h1><h1>Hello2</h1>"
Set regex=##class(%Regex.Matcher).%New(tRegEx)
If regex.Match(htmlSnippet) {
WRITE "BANG-MATCH"
}

2nd: I also can't figure out how to loop through the matches - it doesn't even hit the "BANG-LOCATE" as well.

while regex.Locate() {
WRITE "BANG-LOCATE"
             for i=1:1:regex.GroupCount {
                    WRITE $lb(regex.Group(i))
             }
       }

3rd thing is: is there a web tool for building regular expressions that would provide the output code in caché object script? I know a few online tools but of course they will usually provide output for more wide-spread languages such as java, .net, javascript, etc. This will be very useful for me to work out what the best expression will be to pick up what I need.

Thanks in advance!

Discussion (4)0
Log in or sign up to continue

You should use Locate:

Set tRegEx = "<[^>]*>"
Set htmlSnippet = "<h1>Hello1</h1><h1>Hello2</h1>"
Set regex=##class(%Regex.Matcher).%New(tRegEx)
set regex.Text = htmlSnippet
while regex.Locate() {
    write "Found ",regex.Group," at position ",regex.Start,!
}     

Also it's not possible to parse generic HTML with regular expressions (https://stackoverflow.com/a/1732454/82675). Limited subset of HTML -- maybe.

Hey Murillo,

As for the 1st thing, Match() method only returns true if the whole string is matching regular expression. In your case, it is not so, as there are multiple > in between.

Groups are in the regular expression defined with ( ), and are not meant for iterations in the way you expect. For example below, I am trying to extract every <ABC> DEF </ABC>

Set tRegEx = "<([^>]*)>([^<]*?)</([^>]*)>"
Set htmlSnippet = "<h1>Hello1</h1><h1>Hello2</h1>"
Set regex=##class(%Regex.Matcher).%New(tRegEx,htmlSnippet)

while regex.Locate() {  
    WRITE "BANG MATCH" 
    for i=1:1:regex.GroupCount { 
        WRITE i_" - "_regex.Group(i) 
        } 
    }