Article
· Feb 27 4m read

Insights from unstructured data using SQL Text Search

What is Unstructured Data?
Unstructured data refers to information lacking a predefined data model or organization. In contrast to structured data found in databases with clear structures (e.g., tables and fields), unstructured data lacks a fixed schema. This type of data includes text, images, videos, audio files, social media posts, emails, and more.

Why Are Insights from Unstructured Data Important?
According to an IDC (International Data Corporation) report, 80% of worldwide data is projected to be unstructured by 2025, posing a significant concern for 95% of businesses. Forbes Article
 

How Is the AI World Addressing This Issue?
Within the realm of AI, Generative AI is playing a crucial role in providing solutions for unstructured data. It excels in tasks such as extracting valuable information from text/images/videos, text summarization, and engaging with documents.

Intersystems' Solution for Unstructured Data
Intersystems IRIS presents an exceptional solution known as "SQL Text Search" for searching unstructured data. This feature facilitates semantic context searches on unstructured text data in multiple languages.

What is the advantages when using SQL Text Search ?

Rapid Search: InterSystems IRIS SQL Search quickly navigates extensive data volumes by utilizing an optimized index generation, avoiding sequential searches through the data itself.

Word-Aware Search: Unlike a basic string search, SQL Search relies on semantic structures within the text, with the word being the fundamental unit. This approach minimizes false positives arising from embedded strings or strings spanning two words.

Entity-Aware Search: SQL Search considers semantic relationships to group multiple words into entities. It can search for specified word sequences, words in proximity (regardless of sequence), and words at the beginning or end of an entity, allowing for precise contextual searches.

Language-Aware Search: Recognizing semantic relationships is language-specific, and SQL Search incorporates language models for ten natural languages. It also supports other languages without the need for dictionary or ontology creation.

Pattern Matching: SQL Search offers both wildcard and regular expression (RegEx) matching for character patterns.

Fuzzy Matching: SQL Search includes fuzzy search capabilities for near-matches, considering a calculated degree of variation from the search string. This facilitates matching even in cases of spelling errors.

Derived Matching: SQL Search employs decompounding to match root and component words, and it utilizes synonym tables to match synonymous words and phrases.

How to Implement SQL Text Search with Unstructured Data: A Practical Example with MIND

To demonstrate the implementation of SQL Text Search with unstructured data, we'll use the Microsoft News Dataset (MIND) as an example, focusing on a small training set.

Note: The examples provided use the User namespace, but you can substitute it with any suitable namespace of your choice, ensuring all operations occur within the same namespace.

Step 1: Download and Extract the Training Data
Download the MIND training dataset MIND (msnews.github.io) from MIND and extract the news.tsv file from the archive. The file is tab-delimited, and we will primarily focus on the "category," "title," and "description" fields.

Step 2: Create a Persistent Class
Create a Persistent Class named "Test.JSON" with properties "Title" (String), "Category" (String), and "Description" (String). Create an %IFind.Basic index specifically for the "Description" property to enhance text search performance. Once the index is created, compile the class.

Step 3: Load the MIND Dataset
Load the MIND dataset into the newly created class. You can populate the table using either class and object methods or SQL. The following code demonstrates loading the dataset using class and object methods.

Step 4: Verify Dataset Loading
After loading the dataset, query the table to ensure it accurately contains the expected rows. You can use either the SQL Management Portal or SQL Shell from the Terminal for this verification.

Step 5: Utilize the Index for Queries
Now, leverage the specific index to craft queries and extract insights. For instance, you can use the following queries:

Retrieve Microsoft company news:

SELECT %iFind.Highlight(Description,'"microsoft*"') FROM Test.JSON WHERE %ID %FIND search_index(DescriptionIdx,'"microsoft*"',0,'en')


Retrieve Google company news:

SELECT %iFind.Highlight(Description,'"google*"') FROM Test.JSON WHERE %ID %FIND search_index(DescriptionIdx,'"google*"',0,'en')


Retrieve combined news for Google and Microsoft:

SELECT %iFind.Highlight(Description,'"microsoft*" "google*"') FROM Test.JSON WHERE %ID %FIND search_index(DescriptionIdx,'"microsoft*" "google*"',0,'en')


Disclaimer:
It's crucial to be aware that Semantic and Analytic SQL Search indexes use InterSystems IRIS Natural Language Processing (NLP), which InterSystems has deprecated. As of the latest community edition version (2023.3.0), this feature is still functional, but it may be removed in future versions. Use SQL Text Search at your own risk.

I will write the continuation of this article with alternative option for deprecated SQL Text Search. If you like my article, please vote and provide your valuable suggestions below. Thank you!

Explanation Video

Discussion (3)2
Log in or sign up to continue

Hi Veerarajan,

Your video is available on InterSystems Developers YouTube:

⏯️Easily Extracting Insights from Unstructured Data using SQL Search

https://www.youtube.com/embed/lPJZdgpyRkE?si=BcmR3gi5XiVaAby1
[This is an embedded link, but you cannot view embedded content directly on the site because you have declined the cookies necessary to access it. To view embedded content, you would need to accept all cookies in your Cookies Settings]