How to find multiple substrings in the string optimally?

Example: I have a list of tags that I have to find, and a string with these and other tags separated by commas.
How to find the desired tags in the string optimally?

  • 0
  • 1
  • 142
  • 6
  • 3

Answers

There are a few ways to test the existence of a sub-string in a string...

$find(string,tag)

string[tag

$match(string,tagPatternMatcher)

$lf(list,tag)

Given...

set string="a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z"

and testing for "x"

within a tight loop with a million tests

$find  = .004609

[      = .004813

$match = .008951

$lf    = .023201

$find is marginally quicker than the contains operator with a million operations in .004609 seconds, whilst the other two methods are relatively slower. I say relative because they are still quick in their own rights.

A wider question might not be which is the fastest, but how it's implemented with multiple tag tests. For instance a single variable creation is almost as expensive as the string test itself, and if you wrapped it all in a re-usable method, the method call would be 10x as expensive.

If you are dealing with two strings in the first instance, then a tight loop on $find is going to be your most optimum option. The other functions are probably just as performent in the right context, but if you have to dissect the strings to implement them, then the cost is not the test, but in the other operations that you have to do first. 

tags=$lb("a","b","c"),
  str="c,a1,d,b,f"
   
l=$lfs(str),ptr=0
while $listnext(tags,ptr,tag{
  w:$lf(l,tagtag,!
}

Result

b
c

Run this code and check what's the best for you:

CHK ;
set string="a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z"
set lfs=$listfromstring(string)
set tags="x,d,blabla,m",tagl=$listfromstring(tags)
set stringd=","_string_","
kill arrstr for i=1:1:$length(string,",") set arrstr($piece(string,",",i))=i
kill arrtag for i=1:1:$length(tags,",") set arrtag($piece(tags,",",i))=i
;
//$order the worst
set hh=$zhorolog for i=1:1:1000000 for j=1:1:$listlength(tagl) set tag=$list(tagl,j) set hhh=$zhorolog
write !,"$ll&$li: ",$justify($fnumber(hhh-hh,"",6),10)
set hh=$zhorolog for i=1:1:1000000 for j=1:1:$length(tags,",") set tag=$piece(tags,",",j) set hhh=$zhorolog
write !,"$l&$p: ",$justify($fnumber(hhh-hh,"",6),10)
set hh=$zhorolog for i=1:1:1000000 set tag="" for  set tag=$order(arrtag(tag)) quit:tag=""  set hhh=$zhorolog
write !,"$o: ",$justify($fnumber(hhh-hh,"",6),10)
//
write !!!!
set hh=$zhorolog for j=1:1:$length(tags,",") set tag=$piece(tags,",",j) for i=1:1:1000000 if $data(arrstr(tag)) set hhh=$zhorolog
write !,"$d: ",$justify($fnumber(hhh-hh,"",6),10)
set hh=$zhorolog for j=1:1:$length(tags,",") set tag=$piece(tags,",",j),tagd=","_tag_"," for i=1:1:1000000 if stringd[(tagd) set hhh=$zhorolog
write !,"[: ",$justify($fnumber(hhh-hh,"",6),10)
set hh=$zhorolog for j=1:1:$length(tags,",") set tag=$piece(tags,",",j),tagd=","_tag_"," for i=1:1:1000000 if $find(stringd,tagd) set hhh=$zhorolog
write !,"$f: ",$justify($fnumber(hhh-hh,"",6),10)
set hh=$zhorolog for j=1:1:$length(tags,",") set tag=$piece(tags,",",j) for i=1:1:1000000 if $listfind(lfs,tag) set hhh=$zhorolog
write !,"$lf: ",$justify($fnumber(hhh-hh,"",6),10)
set hh=$zhorolog for j=1:1:$length(tags,",") set tag=$piece(tags,",",j),tagd=","_tag_"," for i=1:1:1000000 if $match(stringd,".*"_tagd_"{1}.*") set hhh=$zhorolog
write !,"$match: ",$justify($fnumber(hhh-hh,"",6),10)
quit

Actually, the case is very practical.

We need to know, which of the tags in DC relate to InterSystems Products and Services.

Every post on DC has a Tags field, which is a comma delimited string,  consists of  any of 150+ tags.

We need to form a Big Tag field, which is filtered Tags field with only the following values:  Caché, Ensemble, HealthShare, Intersystems IRIS, DeepSee, iKnow, Atelier, Online Learning, Documentation, WRC.

E.g. this particular post has Tags: Beginner, Caché

Big Tags will be: Caché

My variant is the following. General function of extracting subtag from tagsring which contains certain tags (intag):

ClassMethod SubTag(intag, tag, dlm As %String) As %String

{



for i=1:1:$L(intag,dlm) set intag($p(intag,dlm,i))=""

for i=1:1:$L(tag,dlm) {

set t=$p(tag,dlm,i)

if $D(intag(t)) set subtag($Seq(subtag))=t

}


while $d(subtag($Seq(l))) {

set $p(subtag,dlm,l) = subtag(l)

}

return $G(subtag)

}

And the calling function:

ClassMethod BigTag(tag As %String) As %String

{

set intag="Caché,Ensemble,HealthShare,Intersystems IRIS,DeepSee,iKnow,Atelier,Online Learning,Documentation,WRC"

return ..SubTag(intag,tag,",")

}

Usage:

USER>w ##class(Utils.Strings).BigTag("Beginner,Caché,Ensemble")
Caché,Ensemble
USER>w ##class(Utils.Strings).BigTag("Beginner,Caché")
Caché
USER>

As a code optimisation exercise, there is a lot going on here, two methods will create two extra stack levels, lots of unnecessary variables and extraneous operations such as $l and $p.

If you created a macro such as this...

#define isISCPS(%arg1) (%arg1["Caché")||(%arg1["Ensemble")||(%arg1["Intersystems IRIS")||(%arg1["DeepSee")||(%arg1["iKnow")||(%arg1["Atelier")||(%arg1["Online Learning")||(%arg1["Documentation")||(%arg1["WRC")

then you will get around 25x more operations per second for negative tests and upwards of 100x for short circuit positive tests, placing highest frequency tags on the left.

But on a long distance (very, very long intag string) my approach with $D will start win! (maybe :)