Portuguese on Pattern Match Work Bench and other improvements |

Article

Alex Woodhead · Jul 1 3m read

Open Exchange

#Artificial Intelligence (AI) #Generative AI (GenAI) #Large Language Model (LLM) #Machine Learning (ML) #InterSystems IRIS

Thank you community for translating an earlier article into Portuguese.
Am returning the favor with a new release of Pattern Match Workbench demo app.

Added support for Portuguese.

The labels, buttons, feedback messages and help-text for user interface are updated.

Pattern Descriptions can be requested for the new language.

The single AI Model for transforming user prompt into Pattern match code was fully retrained.

Values to Pattern Code Model also retrained

The separate AI model for generating Pattern match code from a sample list of values has been retrained.
The previous model release exhibited loss of generalized pattern identification behaviors in order to support new delimiter style patterns.
The new training approach was evolved to better suit both types of challenge.

Post training benchmarking of the model stages was used to quantify the better candidate to preserve generalized pattern solving.
The principle to quantify 100% success for each generated pattern was to attempt to satisfy all its respective candidate sample records.

Overview benchmark report:

Total benchmark tests used	3895
Mean success across all matches	91.75%
Complete pattern match success	81.98%
Partial pattern match success	15.74%
Unsuccessful match records	2.28%

The following table gives examples from benchmark candidates demonstrating partial success.
It shows the percentage of sample records successfully matched to generated pattern code.

Item	Sample Size	Context window	Rows matched	% Match	Actual Generated Pattern	Pattern Template
1	31	31	28	90.3	4UN5AN2.3(4UNP2"/"1UP3.6LP1UP,2"/"1"Ç")5.11N	4UN5AN2(4PUN2"/"1PU3.6LP1PU,2"/"1"Ç")5.12N
2	31	31	17	54.8	5N5ANP3"7Æ6N1ŃS8"1(2"{{{",4"Û4"3"ĪĪċ")3P	5.6N5NPA3"7Æ6N1ŃS8"1(3"{{",4"Û4"3"ĪĪċ")3P
3	31	24	15	62.5	5.8"6Ã02"1.2LNP1.2(1"K"4.6"7499".2NP1"û5Ł¸3"4.7N)	5.8"6Ã02"3(4"51704833"3"ĹĐRÞÇ",1PN,1LP1"K"4.6"7499")1"û5Ł¸3"4.7N
4	31	19	28	90.3	3P5UN5NP5UN5UN3AN3"JŁY"1.4"5"2U1LN	3P5NU5NP4UN5NU4NA3"JŁY"1.4"5"2U1NL
5	31	10	7	22.6	4LNP3.6AN4"×_"5"¤īĩĵ®ü"5AN3AN3LNP2UN3"ċ"5LN	4NPL2.5NA4"×_"5"¤īĩĵ®ü"3.5NA3AN3PLN2UN3"ċ"5NL
6	30	9	26	86.7	5.6"ŀ¬¦"5"!pp%"1UNP5"Ù6"4.8AP3(3N,4"8¸",5UNP)5.11P	5.6"ŀ¬¦"5"!pp%"1PNU5"Ù6"4.7PA3(3N,3UP,4"8¸")4P4.5P
7	25	12	5	20.0	2"AAHH"5"30"3(4.7UP,4L)5"ĥèßłĉ,,,"4.5N	2"AAHH"5"30"5(3UP,4L,4UP,1PL)5"ĥèßłĉ,,,"4.5N
8	31	14	30	96.8	5LP4")))¦¦§÷"4.10LN5.10UNP2"gx"1UN	5PL4")))¦¦§÷"4.11NL5.10PUN2"gx"1UN

Legend

Sample Size - The number records available for test sample. Sometimes a pattern describes less than 31 possible exact matches. Hence the lower values
Context Window - Maximum number of sample records that fit to generate a new pattern
Rows matched - This is the number of sample records that were matched by the generated pattern

Note that items 4, 6 and 8 achieve a match count greater than the context window.

When patterns are generated from values the context window used and proportion of matching is returned.

For challenging data samples it can be productive to retry the "Pattern from Values" button for improved results.

Showing the context window by row is useful to provide opportunity for specific records to be elevated into prompt for generative consideration.

Benchmark inferencing was conducted via HuggingFace spaces to scale up available GPU for timely execution.

Reporting is useful to analyse qualitative closeness of proposals by the LLM in terms of:
* Match type ( String, Number, Alphanumeric, Uppercase-Alpha)
* Choice of Quantity range ( 2-to-3 of ... )
* Option lists ( "123" or "456" )
* Structured elements using brackets

This opens up opportunities for training to be improved.

Please feel free to feedback on improvements for:

Language translations
Pattern match behaviors.