Question
· Apr 13, 2023

Breaking a string into words

Is there any ObjectScript or a basic function that takes a string and separates it into words + punctuation/spaces array/list? Just not to reinvent the wheel. Say, process "It is a test, after all" into "It", space, "is", space, "a", space, "test", ", ", "after", space, "all". Or something to that effect.

Discussion (9)2
Log in or sign up to continue

Depending on the fidelity you need, something like this would work:

set str = "abc def!  xyz"
set punctuation = "'!""#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~"
set strNoPuncuation = $tr(str, punctuation, $j("", $l(punctuation)))
set strDedupeWhitespaces = $zstrip(strNoPuncuation,"<=>P")
set out = $lfs(strDedupeWhitespaces, " ")

Another approach. Simpler and likely faster but it will merge sentence ends without whitespace afterwards:

set str = "abc def!  xyz"
set strNoPuncuation = $zstrip(str,"*P",," ")
set strDedupeWhitespaces = $zstrip(strNoPuncuation,"<=>P")
set out = $lfs(strDedupeWhitespaces, " ")

Check $translate, $zstrip.

If you want more fidelity/features check %iKnow.Stemming package.

Plus one for the LFS approach, (I wrote this before noticing it was in Eduard's more complete solution.
If you don't need to keep the spaces, because they're the delimiter, I find that $LFS works as a quick and dirty.  It depends what you want the punctuation for.  You can put spaces back in when do something with the list if you need them. 
 

You could also use a regular expression with the %Regex.Matcher class

set regex = ##Class(%Regex.Matcher).%New("(/w)*")

The "/w" refers to any word character include alphabetic, numeric, and connecting characters).  This is wrapped in a grouping expression '()' and finally the * say match 0 or more occurences.  

You can then examine the GroupCount and Group multidimensional properties to see the results.

That's the code I ended up with. Thanks for your help, everybody!

 ; str is parsed into two arrays, words and separators (spaces and punctuation)
 ; Trim leading and trailing spaces here if needed
 L=$L(str),(currWord,currSep)="",cnt=0
 i=1:1:{
  S currChar=$E(str,i,i)
  I $MATCH(currChar,"\w") {
     currWord=currWord_currChar
     currSep'="" {
       sepAr(cnt)=currSep,currSep=""
   }
}
  ELSE {
   currSep=currSep_currChar 
   currWord'="" {
     cnt=cnt+1,wordAr(cnt)=currWord,currWord=""
   }
  }
 }