Article
· 20 hr ago 3m read

Innovating for Generative Elegance

Audience

Those curious in exploring new GenerativeAI usecases.

Shares thoughts and rationale when training generative AI for pattern matching.

Challenge 1 - Simple but no simpler

A developer aspires to conceive an elegant solution to requirements.
Pattern matches ( like regular expressions ) can be solved for in many ways. Which one is the better code solution?
Can an AI postulate an elegant pattern match solution for a range of simple-to-complex data samples?

Consider the three string values:

  • "AA"
  • "BB"
  • "CC"

The expression: "2 Alphabetic characters" matches all these values and other intuitively similar values in general flexible way.

Alternatively the expression: "AA" or "BB" or "CC" would be a very specific way to match only these values:

Another way to solve would be: "A" or "B" or "C", twice over.

Challenge 2  - Incomplete sample

Pattern problems rarely have every example specified.
A performative AI needs to accept a limited incomplete sample of data rows and postulate a reasonable pattern match expression.
A Turing goal would be to meet parity with human inference for a pattern for representative but incomplete data.

Better quality of sample processing is a higher prioity than expanding the token window for larger sample sizes.

 

Challenge 3 - Leverage repeating sequences

Extending the previous example, to also include single character values

  • "A"
  • "B"
  • "C"

This seems more elegant than specifying ALL the possible values long-hand.

 if test?1(1"A",1"B",1"C",1"AA",1"BB",1"CC")

 

Challenge 4 - Delimited data bias

A common need beyond generalized patterns is to solve for delimited data. For example a random phone number format

213-5729-57986

Could be solved by expression:
3 Numeric, dash, 4 numeric, dash, 4 numeric 

 if test?3N1"-"4N1"-"4N

This can be normalized with repeat sequence to:

 if test?3N2(1"-"4)

Essentially this means having a preference for explicitly specifying a delimiter for example "-" instead of generalizing delimiters as punctuation characters. So generated output should prefer to avoid over-generalization for example:

 if test?3N1P4N1P4N

 

Challenge 5 - Repeating sequences

Consider formatted numbers with common prefix codes.

The AI model detects three common sequences across the values and biases the solution to reflect an interest in this feature:

On this occasion the AI has decided to generate a superfluous "13" string match.

Howeve, as indicated by the tool, the pattern will match all the values provided.

The pattern can easily be adjusted in the free text description and regenerated.

Inference Speed

Workbench AI assistance with qualified partial success can be an accelerator to implementation.
Above a complexity threshold an AI assistant can deduce proposals faster than manual analysis.
Consider the following AI inference attempt with qualified partial success:

The AI assistant uses as many rows of data that it can fit into its token context window for processing, skipping excess data rows.
The number of rows is quantified in the generated output showing how data was truncated for inference.
This can be useful to elevate preferred data rows back into the context window for refined reprocessing.

 

Training Effort

Targeting Nvidia Cuda A10 GPU on Huggingface.
Supervised model training.

Stage Continuous GPU training
Prototype base dataset 4 days
Main dataset 13 days
Second refined dataset 2 days

 

Conclusion

Single-shot generative inference with constrained token-size can usefully approach discrete code solution elegance in absence of chain-of-thought processing by curating subject expert bias into base training data.
AI assistants can participate in iterative solution workflows.

 

Explore more

Get hands on and explore the technology demo currently hosted via Huggingface.
The cog symbol on buttons in demo indicate where AI generation is being employed

Demo written for English, French and Spanish audience.

Discussion (0)1
Log in or sign up to continue