Search | InterSystems Developer Community

All

Record linkage using InterSystems IRIS, Apache Zeppelin, and Apache Spark

Hi all. We are going to find duplicates in a dataset using Apache Spark Machine Learning algorithms. Note: I have done the following on Ubuntu 18.04, Python 3.6.5, Zeppelin 0.8.0, Spark 2.1.1 Introduction In previous articles we have done the following: The way to launch Jupyter Notebook + Apache Spark + InterSystems IRIS Load a ML model into InterSystems IRIS K-Means clustering of the Iris Dataset The way to launch Apache Spark + Apache Zeppelin + InterSystems IRIS In this series of articles, we explore Machine Learning and record linkage. Imagine that we merged databases of neighboring shops. Most probably there will be records that are very similar to each over. Some records will be of the same person and we call them duplicates. Our purpose is to find duplicates. Why is this necessary? First of all, to combine data from many different operational source systems into one logical data model, which can then be subsequently fed into a business intelligence system for reporting and analytics. Secondly, to reduce data storage costs. There are some additional use cases. Approach What data do we have? Each row contains different anonymized information about one person. There are family names, given names, middle names, date of births, several documents, etc. The first step is to look at the number of records because we are going to make pairs. The number of pairs equals n*(n-1)/2. So, if you have less than 5000 records, than the number of pairs would be 12.497.500. It is not that many, so we can pair each record. But if you have 50.000, 100.000 or more, the number of pairs more than a billion. This number of pairs is hard to store and work with. So, if you have a lot of records, it would be a good idea to reduce this number. We will do it by selecting potential duplicates. A potential duplicate is a pair, that might be a duplicate. We will detect them based on several simple conditions. A specific condition might be like: (record1.family_name == record2.familyName) & (record1.givenName == record2.givenName) & (record1.dateOfBirth == record2.dateOfBirth) but keep in mind that you can miss duplicates because of strict logical conditions. I think the optimal solution is to choose important conditions and use no more than two of them with & operator. But you should convert each feature into one record shape beforehand. For example, there are several ways to store dates: 1985-10-10, 10/10/1985, etc convert to 10-10-1985(month-day-year). The next step is to label the part of the dataset. We will randomly choose, for example, 5000-10000 pairs (or more, if you are sure that you can label all of them). We will save them to IRIS and label these pairs in Jupyter (Unfortunately, I didn't find an easy and convenient way to do it. Also, you can label them in PySpark console or wherever you want). After that, we will make a feature vector for each pair. During the labeling process probably you noticed which features are important and what they equal. So, test different approaches to creating feature vectors. Test different machine learning models. I chose a random forest model because of tests (accuracy/precision/recall/etc). Also, you can try decision trees, Naive Bayes, other classification model and choose the one that will be the best. Test the result. If you are not satisfied with the result, try to change feature vectors or change a ML model. Finally, fit all pairs into the model and look at the result. Implementation Load a dataset: %pysparkdataFrame=spark.read.format("com.intersystems.spark").option("url", "IRIS://localhost:51773/******").option("user", "*******").option("password", "*********************").option("dbtable", "**************").load() Clean the dataset. For example, null (check every row) or useless columns: %pysparkcolumns_to_drop = ['allIdentityDocuments', 'birthCertificate_docSource', 'birthCertificate_expirationDate', 'identityDocument_expirationDate', 'fullName']droppedDF = dataFrame.drop(*columns_to_drop) Prepare the dataset for making pairs: %pysparkfrom pyspark.sql.functions import col# rename columns namesreplacements1 = {c : c + '1' for c in droppedDF.columns}df1 = droppedDF.select([col(c).alias(replacements1.get(c, c)) for c in droppedDF.columns])replacements2 = {c : c + '2' for c in droppedDF.columns}df2 = droppedDF.select([col(c).alias(replacements2.get(c, c)) for c in droppedDF.columns]) To make pairs we will use join function with several conditions. %pysparktestTable = (df1.join(df2, (df1.ID1 < df2.ID2) & ( (df1.familyName1 == df2.familyName2) & (df1.givenName1 == df2.givenName2) | (df1.familyName1 == df2.familyName2) & (df1.middleName1 == df2.middleName2) | (df1.familyName1 == df2.familyName2) & (df1.dob1 == df2.dob2) | (df1.familyName1 == df2.familyName2) & (df1.snils1 == df2.snils2) | (df1.familyName1 == df2.familyName2) & (df1.addr_addressLine1 == df2.addr_addressLine2) | (df1.familyName1 == df2.familyName2) & (df1.addr_okato1 == df2.addr_okato2) | (df1.givenName1 == df2.givenName2) & (df1.middleName1 == df2.middleName2) | (df1.givenName1 == df2.givenName2) & (df1.dob1 == df2.dob2) | (df1.givenName1 == df2.givenName2) & (df1.snils1 == df2.snils2) | (df1.givenName1 == df2.givenName2) & (df1.addr_addressLine1 == df2.addr_addressLine2) | (df1.givenName1 == df2.givenName2) & (df1.addr_okato1 == df2.addr_okato2) | (df1.middleName1 == df2.middleName2) & (df1.dob1 == df2.dob2) | (df1.middleName1 == df2.middleName2) & (df1.snils1 == df2.snils2) | (df1.middleName1 == df2.middleName2) & (df1.addr_addressLine1 == df2.addr_addressLine2) | (df1.middleName1 == df2.middleName2) & (df1.addr_okato1 == df2.addr_okato2) | (df1.dob1 == df2.dob2) & (df1.snils1 == df2.snils2) | (df1.dob1 == df2.dob2) & (df1.addr_addressLine1 == df2.addr_addressLine2) | (df1.dob1 == df2.dob2) & (df1.addr_okato1 == df2.addr_okato2) | (df1.snils1 == df2.snils2) & (df1.addr_addressLine1 == df2.addr_addressLine2) | (df1.snils1 == df2.snils2) & (df1.addr_okato1 == df2.addr_okato2) | (df1.addr_addressLine1 == df2.addr_addressLine2) & (df1.addr_okato1 == df2.addr_okato2) ))) Check the size of returned dataframe: %pysparkdroppedColumns = ['prevIdentityDocuments1', 'birthCertificate_docDate1', 'birthCertificate_docNum1', 'birthCertificate_docSer1', 'birthCertificate_docType1', 'identityDocument_docDate1', 'identityDocument_docNum1', 'identityDocument_docSer1', 'identityDocument_docSource1', 'identityDocument_docType1', 'prevIdentityDocuments2', 'birthCertificate_docDate2', 'birthCertificate_docNum2', 'birthCertificate_docSer2', 'birthCertificate_docType2', 'identityDocument_docDate2', 'identityDocument_docNum2', 'identityDocument_docSer2', 'identityDocument_docSource2', 'identityDocument_docType2'] print(testTable.count())testTable.drop(*droppedColumns).show() # I dropped several columns just for show() function Randomly take a part of the dataframe: %pysparkrandomDF = testTable.sample(False, 0.33, 0)randomDF.write.format("com.intersystems.spark").\option("url", "IRIS://localhost:51773/DEDUPL").\option("user", "*****").option("password", "***********").\option("dbtable", "deduplication.unlabeledData").save() Label pairs in Jupyter Run the following (it will widen the cells). from IPython.core.display import display, HTMLdisplay(HTML("<style>.container { width:100% !important; border-left-width: 1px !important; resize: vertical}</style>")) Load dataframe: unlabeledDF = spark.read.format("com.intersystems.spark").option("url", "IRIS://localhost:51773/DEDUPL").option("user", "********").option("password", "**************").option("dbtable", "deduplication.unlabeledData").load() Return all the elements of the dataset as a list: rows = labelledDF.collect() The convenient way to display pairs: from IPython.display import clear_outputfrom prettytable import PrettyTablefrom collections import OrderedDict def printTable(row): row = OrderedDict((k, row.asDict()[k]) for k in newColumns) table = PrettyTable() column_names = ['Person1', 'Person2'] column1 = [] column2 = [] i = 0 for key, value in row.items(): if key != 'ID1' and key != 'ID2' and key != "prevIdentityDocuments1" and key != 'prevIdentityDocuments2' and key != "features": if (i < 20): column1.append(value) else: column2.append(value) i += 1 table.add_column(column_names[0], column1) table.add_column(column_names[1], column2) print(table) List where we will store rows: listDF = [] The labeling process: from pyspark.sql import Rowfrom IPython.display import clear_outputimport time# 3000 - 4020for number in range(3000 + len(listDF), len(rows)): row = rows[number] if (len(listDF) % 10) == 0: print(3000 + len(listDF)) printTable(row) result = 0 label = 123 while True: result = input('duplicate? y|n|stop') if (result == 'stop'): break elif result == 'y': label = 1.0 break elif result == 'n': label = 0.0 break else: print('only y|n|stop') continue if result == 'stop': break tmp = row.asDict() tmp['label'] = label newRow = Row(**tmp) listDF.append(newRow) time.sleep(0.2) clear_output() Create a dataframe again: newColumns.append('label')labelledDF = spark.createDataFrame(listDF).select(*newColumns) Save it to IRIS: labeledDF.write.format("com.intersystems.spark").\option("url", "IRIS://localhost:51773/DEDUPL").\option("user", "***********").option("password", "**********").\option("dbtable", "deduplication.labeledData").save() Feature vector and ML model Load a dataframe into Zeppelin: %pysparklabeledDF = spark.read.format("com.intersystems.spark").option("url", "IRIS://localhost:51773/DEDUPL").option("user", "********").option("password", "***********").option("dbtable", "deduplication.labeledData").load() Feature vector generation: %pysparkfrom pyspark.sql.functions import udf, structimport stringdistfrom pyspark.sql.types import StructType, StructField, StringType, IntegerType, DateType, ArrayType, FloatType, DoubleType, LongType, NullTypefrom pyspark.ml.linalg import Vectors, VectorUDTimport roman translateMap = {'A' : 'А', 'B' : 'В', 'C' : 'С', 'E' : 'Е', 'H' : 'Н', 'K' : 'К', 'M' : 'М', 'O' : 'О', 'P' : 'Р', 'T' : 'Т', 'X' : 'Х', 'Y' : 'У'} column_names = testTable.drop('ID1').drop('ID2').columnscolumnsSize = len(column_names)//2 def isRoman(numeral): numeral = numeral.upper() validRomanNumerals = ["M", "D", "C", "L", "X", "V", "I", "(", ")"] for letters in numeral: if letters not in validRomanNumerals: return False return True def differenceVector(params): differVector = [] for i in range(0, 3): if params[i] == None or params[columnsSize + i] == None: differVector.append(0.0) elif params[i] == 'НЕТ' or params[columnsSize + i] == 'НЕТ': differVector.append(0.0) elif params[i][:params[columnsSize + i].find('-')] == params[columnsSize + i][:params[columnsSize + i].find('-')] or params[i][:params[i].find('-')] == params[columnsSize + i][:params[i].find('-')]: differVector.append(0.0) else: differVector.append(stringdist.levenshtein(params[i], params[columnsSize+i])) for i in range(3, columnsSize): # snils if i == 5 or i == columnsSize + 5: if params[i] == None or params[columnsSize + i] == None or params[i].find('123-456-789') != -1 or params[i].find('111-111-111') != -1 \ or params[columnsSize + i].find('123-456-789') != -1 or params[columnsSize + i].find('111-111-111') != -1: differVector.append(0.0) else: differVector.append(float(params[i] != params[columnsSize + i])) # birthCertificate_docNum elif i == 10 or i == columnsSize + 10: if params[i] == None or params[columnsSize + i] == None or params[i].find('000000') != -1 or params[i].find('000000') != -1 \ or params[columnsSize + i].find('000000') != -1 or params[columnsSize + i].find('000000') != -1: differVector.append(0.0) else: differVector.append(float(params[i] != params[columnsSize + i])) # birthCertificate_docSer elif i == 11 or i == columnsSize + 11: if params[i] == None or params[columnsSize + i] == None: differVector.append(0.0) # check if roman or not, then convert if roman else: docSer1 = params[i] docSer2 = params[columnsSize + i] if isRoman(params[i][:params[i].index('-')]): docSer1 = str(roman.fromRoman(params[i][:params[i].index('-')])) secPart1 = '-' for elem in params[i][params[i].index('-') + 1:]: if 65 <= ord(elem) <= 90: secPart1 += translateMap[elem] else: secPart1 = params[i][params[i].index('-'):] docSer1 += secPart1 if isRoman(params[columnsSize + i][:params[columnsSize + i].index('-')]): docSer2 = str(roman.fromRoman(params[columnsSize + i][:params[columnsSize + i].index('-')])) secPart2 = '-' for elem in params[columnsSize + i][params[columnsSize + i].index('-') + 1:]: if 65 <= ord(elem) <= 90: secPart2 += translateMap[elem] else: secPart2 = params[columnsSize + i][params[columnsSize + i].index('-'):] break docSer2 += secPart2 differVector.append(float(docSer1 != docSer2)) elif params[i] == 0 or params[columnsSize + i] == 0: differVector.append(0.0) elif params[i] == None or params[columnsSize + i] == None: differVector.append(0.0) else: differVector.append(float(params[i] != params[columnsSize + i])) return differVector featuresGenerator = udf(lambda input: Vectors.dense(differenceVector(input)), VectorUDT()) %pysparknewTestTable = testTable.withColumn('features', featuresGenerator(struct(*column_names))) # all pairsdf = df.withColumn('features', featuresGenerator(struct(*column_names))) # labeled pairs Split labeled dataframe into training and test dataframes: %pysparkfrom pyspark.ml import Pipelinefrom pyspark.ml.classification import RandomForestClassifierfrom pyspark.ml.feature import IndexToString, StringIndexer, VectorIndexerfrom pyspark.ml.evaluation import MulticlassClassificationEvaluator # split labelled data into two sets(trainingData, testData) = df.randomSplit([0.7, 0.3]) Train a RF model: %pysparkfrom pyspark.ml.classification import RandomForestClassifier rf = RandomForestClassifier(labelCol='label', featuresCol='features') pipeline = Pipeline(stages=[rf]) model = pipeline.fit(trainingData) # Make predictions.predictions = model.transform(testData)# predictions.select("predictedLabel", "label", "features").show(5) Test the RF model: %pysparkTP = int(predictions.select("label", "prediction").where((col("label") == 1) & (col('prediction') == 1)).count())TN = int(predictions.select("label", "prediction").where((col("label") == 0) & (col('prediction') == 0)).count())FP = int(predictions.select("label", "prediction").where((col("label") == 0) & (col('prediction') == 1)).count())FN = int(predictions.select("label", "prediction").where((col("label") == 1) & (col('prediction') == 0)).count())total = int(predictions.select("label").count()) print("accuracy = %f" % ((TP + TN) / total))print("precision = %f" % (TP/ (TP + FP))print("recall = %f" % (TP / (TP + FN)) How it looks: Use the RF model on all the pairs: %pysparkallData = model.transform(newTestTable) Check how many duplicates are found: %pysparkallData.where(col('prediction') == 1).count() Or look at the dataframe: Conclusion This approach is not ideal. You can make it better by experimenting with feature vectors, a model or increasing the size of labeled dataset. Also, you can do the same to find duplicates, for example, in shops database, historical research, etc... Links Apache Zeppelin Jupyter Notebook Apache Spark Record Linkage ML models The way to launch Jupyter Notebook + Apache Spark + InterSystems IRIS Load a ML model into InterSystems IRIS K-Means clustering of the Iris Dataset The way to launch Apache Spark + Apache Zeppelin + InterSystems IRIS GitHub

Article

Eduard Lebedyuk · May 14, 2018

Continuous Delivery of your InterSystems solution using GitLab - Index

In this series of articles, I'd like to present and discuss several possible approaches toward software development with InterSystems technologies and GitLab. I will cover such topics as: First article Git basics, why a high-level understanding of Git concepts is important for modern software development, How Git can be used to develop software (Git flows) Second article GitLab Workflow - a complete software life cycle process - from idea to user feedback Continuous Delivery - software engineering approach in which teams produce software in short cycles, ensuring that the software can be reliably released at any time. It aims at building, testing, and releasing software faster and more frequently. Third article GitLab installation and configuration Connecting your environments to GitLab Fourth article Continuous delivery configuration Fifth article Containers and how (and why) they can be used. Sixth article Main components for a continuous delivery pipeline with containers How they all work together. Seventh article Continuous delivery configuration with containers Eighth article Continuous delivery configuration with InterSystems Cloud Manager Ninth article Container architecture Tenth article CI/CD for Configuration and Data Eleventh article Interoperability and CI/CD Twelfth article Dynamic Inactivity Timeouts In this series of articles, I covered general approaches to the Continuous Delivery. It is an extremely broad topic and this series of articles should be seen more as a collection of recipes rather than something definitive. If you want to automate building, testing and delivery of your application Continuous Delivery in general and GitLab in particular is the way to go. Continuous Delivery and containers allows you to customize your workflow as you need it.

Announcement

Anastasia Dyubaylo · Sep 13, 2019

New Video: JSON and XML persistent data serialization in InterSystems IRIS

Hi Everyone! New video, recorded by @Stefan.Wittmann, is already on InterSystems Developers YouTube: JSON and XML persistent data serialization in InterSystems IRIS Need to work with JSON or XML data? InterSystems IRIS supports multiple inheritance and provides several built-in tools to easily convert between XML, JSON, and objects as you go. Learn more about the multi-model development capabilities of InterSystems IRIS on Learning Services sites. Enjoy watching the video! Can confirm that the %JSON.Adaptor tool is extremely useful! This was such a great addition to the product.In Application Services, we've used it to build a framework which allows us to not only expose our persistent classes via REST but also authorize different levels of access for different representations of each class (for example, all the properties, vs just the Name and the Id). The "Mappings and Parameters" feature is especially useful:https://irisdocs.intersystems.com/irislatest/csp/docbook/DocBook.UI.Page.cls?KEY=GJSON_adaptorAlso, @Stefan are you writing backwards while you talk? That's impressive. Anyone who is doubting multiple-inheritance is insane.Although calling this kind of inheritance 'mixin-classes' helps I've noticed, mixing in additional features. https://hackaday.com/tag/see-through-whiteboard/

Question

Scott Roth · Oct 14, 2019

Integrating InterSystems IRIS with Source Control Systems (GitHub/Azure)

I am currently evaluating Source Control systems that we can use for both MS SQL, MS Visual Studio, and InterSystems IRIS. For both MS SQL and MS Visual Studio we do have the option of either Azure or GitHub. I understand when we upgrade to IRIS 2019.1 we have options for Source Control, and in previous Global Summit's I have heard GitHub discussed. So why can't I user GitHub for both MS SQL/MS Visual Studio and IRIS? A couple of questions come to mind starting to think about Source Control When integrating Source Control in an IRIS environment, is that source control just used for Code done in Studio, or can it be used for DTL, Business Process, Schema Editor, and etc? Has anyone integrated IRIS with GitHub? Can you please provide examples. How secure is the Source Control if you integrate it with GitHub? Just trying to figure out the better route and if we can kill two birds with one stone. Thanks Scott Roth There is one exception though: if you are using some IRIS UI tools e.g. to develop productions you need to manage export/import these artefacts into files to let them be committed to Github then. Preferabbly automatically (e.g. per each Save operation). Scott, could you please specify, are you going to use Visual Studio or Visual Studio Code. Both are from Microsoft, but completely different products. I know nothing about MS SQL. Visual Studio (not code) does support officially IRIS. There was one project, but already closed. Visual Studio Code or VSCode itself supports GitHub by default very easy, just edit files locally, commit and push changes. VSCode-ObjectScript is my extension for VSCode, adds support for Caché, Ensemble, and IRIS, any versions from 2016.2 (where were added Atelier support). It perfectly supports classes and routines, but if talk about "DTL, Business Process, Schema Editor, and etc" it does not have native support for it, but all of those staff based on classes, so, you can edit it as classes. IRIS itself does not need to have support for GitHub, this task for an editor. How secure is the Source Control if you integrate it with GitHub? What do you mean here? Developers should manually choose what to commit to source control. It is not an automated task. Hi Scott! There is no need to integrate IRIS with Github. It's more about how the IDE you are using to develop IRIS solutions is integrated with Github. And the majority of modern IDE are integrated with Github already: VSCode goes with Git/Github integration out of the box ( and I believe Visual Studio too (as soon as Github is Microsoft now too). If the question is how you can develop IRIS solutions having the code managed in Github there are a lot of approaches. You can check these videos made by myself which illustrate: How to create IRIS Application in Github, develop it in VSCode and commit changes into the repo How to collaborate and make changes to an already existing project in Github repo using Github Flow And: Atelier can be integrated with Git Studio also has the integration with Git we own our code and cannot allow it to be on another party's site (github) so our tech stack is much more interesting to work with It’s not a problem at all. You can use on-premises versions of GitHub, GitLab, Bitbucket or anything else, depends on your budget and needs. It is a problem if you maintain full ownership of your code. To make sure you maintain full ownership of your code you would have to use your own in house for repo is what I meant. Not my policy, it’s my companies So, company policy forces to keep all the source code only in Caché? You can install own source control server, even GitHub. It will be completely your own server anywhere you will decide. With no ability to connect from outside if you would need it. So, yes, I still sure, not a problem at all. I have been worked in company with two contours, one is for development with no access to internet, completely isolated. And another network for outside world. And we had to use two PCs, for our work. And we anyway we were able to use source control

Announcement

Anastasia Dyubaylo · Sep 4, 2019

[September 18, 2019] Upcoming Webinar: InterSystems MLToolkit: AI Robotization

Hey Developers! We are pleased to invite you to the upcoming webinar "InterSystems MLToolkit: AI Robotization" on 18th of September at 10:00 (GMT+3)! Machine Learning (ML) Toolkit - a set of extensions to implement machine learning and artificial intelligence on the InterSystems IRIS Data Platform. As part of this webinar, InterSystems Sales Engineers @Sergey Lukyanchikov and @Eduard Lebedyuk plan to present an approach to the robotization of these tasks, i.e. to ensure their autonomous adaptive execution proceeds within the parameters and rules you specify. Self-learning neural networks, self-monitoring analytical processes, agency of analytical processes are the main subjects of this webinar. Webinar is aimed at both experts in Data Science, Data Engineering, Robotic Process Automation - and those who just discover the world of artificial intelligence and machine learning. We are waiting for you at our event! Date: 18 September, 10:00 – 11:00 (GMT+3).Note: The language of the webinar is Russian. Register for FREE today!

Announcement

Olga Zavrazhnova · Nov 8, 2019

Review InterSystems IRIS on G2 - new challenge on Global Masters

Hi Developers, New challenge on Global Masters: Review InterSystems IRIS on G2 and get 3000 points!The page for InterSystems IRIS is new on G2, and we need so much your voices and experience to be shared with the worldwide audience!Write a review on G2 using this link, complete the challenge on Global Masters and get 3000 points after your review is published! Please check the additional information about Global Masters: How to join InterSystems Global Masters Global Masters Badges Descriptions Global Masters Levels Descriptions Changes in Global Masters Program How to earn points on Global Masters If you have not joined InterSystems Global Masters Advocacy Hub yet, let's get started right now! Feel free to ask your questions in the comments to this post.

Article

sween · Nov 7, 2019

Export InterSystems IRIS Data to BigQuery on Google Cloud Platform

Loading your IRIS Data to your Google Cloud Big Query Data Warehouse and keeping it current can be a hassle with bulky Commercial Third Party Off The Shelf ETL platforms, but made dead simple using the iris2bq utility. Let's say IRIS is contributing to workload for a Hospital system, routing DICOM images, ingesting HL7 messages, posting FHIR resources, or pushing CCDA's to next provider in a transition of care. Natively, IRIS persists these objects in various stages of the pipeline via the nature of the business processes and anything you included along the way. Lets send that up to Google Big Query to augment and compliment the rest of our Data Warehouse data and ETL (Extract Transform Load) or ELT (Extract Load Transform) to our hearts desire. A reference architecture diagram may be worth a thousand words, but 3 bullet points may work out a little bit better: It exports the data from IRIS into DataFrames It saves them into GCS as .avro to keep the schema along the data: this will avoid to specify/create the BigQuery table schema beforehands. It starts BigQuery jobs to import those .avro into the respective BigQuery tables you specify. Under the hood, iris2bq it is using the Spark framework for the sake of simplicity, but no Hadoop cluster is needed. It is configured as a "local" cluster by default, meaning the application and is running standalone. The tool is meant to be launched on an interval either through cron or something like Airflow. All you have to do is point it at your IRIS instance, tell it what tables you want to sync to Big Query, then they magically sync to an existing dataset or a creates a new one that you specify. How To Setup And if a reference architecture and 3 bullet points didn't do a good job explaining it, maybe actually running it will: Google Cloud Setup You can do this anyway you want, here are a few options for you, but all you have to do in GCP is: Create a Project Enable the API's of Big Query and Cloud Storage Create a service account with access to create resources and download the json file. Using the Google Cloud Console (Easiest) https://cloud.google.com Using gcloud (Impress Your Friends): gcloud projects create iris2bq-demo--enable-cloud-apis With Terraform (Coolest): Create a main.tf file after modifying the values: // Create the GCP Project resource "google_project" "gcp_project" { name = "IRIS 2 Big Query Demo" project_id = "iris2bq-demo" // You'll need this org_id = "1234567" } // Enable the APIS resource "google_project_services" "gcp_project_apis" { project = "iris2bq-demo" services = ["bigquery.googleapis.com", "storage.googleapis.com"] } Then do a: terraform init terraform plan terraform apply IRIS Setup Lets quickly jam some data into IRIS for a demonostration. Create a class like so: Class User.People Extends (%Persistent, %Populate) { Property ID As %String; Property FirstName As %String(POPSPEC = "NAME"); Property LastName As %String(POPSPEC = "NAME"); } Then run the populate to generate some data. USER>do ##class(User.People).Populate(10000) Alternatively, you can grab an irissession, ensure you are in the USER namespace and run the following commands. USER> SET result=$SYSTEM.SQL.Execute("CREATE TABLE People(ID int, FirstName varchar(255), LastName varchar(255))") USER> for i=1:1:100000 { SET result=$SYSTEM.SQL.Execute("INSERT INTO People VALUES ("_i_", 'First"_i_"', 'Last"_i_"')") } Both routes will create a table called "People" and insert 100,000 rows. Either way you to and if everything worked out, you should be able to query for some dummy rows in IRIS. These are the rows we are sending to Big Query. IRIS2BQ Setup Download the latest release of the utility iris2bq, and unzip it. Then cd to the `bin` directory and move over your credentials to the root of this directory and create an application.conf file as below into the same root. Taking a look at the below configuration file here, you can get an idea of how the utility works. Specify a jdbc url and the credentials for the system user. Give it a list of tables that you wan to appear in Big Query. Tell the utility which project to point to, the location of your credentials file. Then tell it a target Big Query Dataset, and a target bucket to write the .avro files to. Quick note on the GCP block, the dataset and bucket can either exist or not exist as the utility will create those resources for you. jdbc { url = "jdbc:IRIS://127.0.0.1:51773/USER" user = "_SYSTEM" password = "flounder" // the password is flounder tables = [ "people" ] //IRIS tables to send to big query } gcloud { project = "iris2bq-demo" service-account-key-path = "service.key.json" //gcp service account bq.dataset = "iris2bqdemods" // target bq dataset gcs.tmp-bucket = "iris2bqdemobucket" //target storage bucket } Run At this point we should be parked at our command prompt in the root of the utility, with a conf file we created and the json credentials file. Now that we have all that in place, lets run it and check the result. $ export GOOGLE_CLOUD_PROJECT=iris2bq-demo $ exportGOOGLE_APPLICATION_CREDENTIALS=service.key.json $./iris2bq -Dconfig.file=configuration.conf The output is a tad chatty, but if the import was successful it will state `people import done!` Lets head over to to Big Query and inspect our work... The baseNUBE team hopes you found this helpful! Now setup a job to run it on an interval and JOIN all over your IRIS data in Big Query! The article is considered as InterSystems Data Platform Best Practice. Hi Ron, thanks for this great article. There's a typo which creates a wondering question about the potentiality of Google Cloud : Using the Google Cloud Console (Easiest) https://could.google.com fixed, thank you!

Announcement

Jeff Fried · Nov 4, 2019

InterSystems IRIS and IRIS for Health 2019.1.1 now available

The 2019.1.1 versions of InterSystems IRIS and IRIS for Health are now Generally Available! These are maintenance releases in the EM (Extended Maintenance) stream. The changes are reflected in the 2019.1 documentation, which is available online and features a new look including a card-style TOC layout. The build number for these releases is 2019.1.1.612.0. A full set of kits and containers for both products are available from the WRC Software Distribution site, including community editions of InterSystems IRIS and IRIS for Health. This release also adds support for Red Hat Enterprise Linux 8, in addition to the previously supported platforms detailed in the 2019.1 Supported Platforms document. InterSystems IRIS Data Platform 2019.1.1 includes maintenance updates in a number of areas, as described in the online documentation here. It also includes three new features, described in the online documentation here: Support for the InterSystems API Manager X12 element validation. In-place conversion from Caché and Ensemble to InterSystems IRIS IRIS for Health 2019.1.1 includes maintenance updates in a number of areas, as described in the online documentation here. It also includes two new features, described in the online documentation here: Support for the InterSystems API Manager X12 element validation

Announcement

Michelle Spisak · Oct 22, 2019

New Videos! High Speed, Multi-Model Capabilities of InterSystems IRIS™

The Learning Services Online Learning team has posted new videos to help you learn the benefits of InterSystems IRIS. Take a peek to see what you stand to gain from making the switch to InterSystems IRIS! Why Multi-Model?Stefan Wittmann presents use cases for the multi-model data access of InterSystems IRIS data platform. He shows the multi-model architecture that allows you to use the data model that best fits each task in your application — relational, object, or even direct/native access — all accessible through the language of your choice. The Speed and Power of InterSystems IRISInterSystems IRIS powers many of the world’s most powerful applications — applications that require both speed and power for ingesting massive amounts of data, in real time, at scale. Learn about these features and more in this video!

Announcement

Evgeny Shvarov · Dec 2, 2019

ObjectScript Template for Advent Of Code 2019 and InterSystems IRIS in Docker

Hi Developers! For those who want to participate in the Advent of Code 2019 and code with ObjectScript in IRIS, I created a very simple but handy Github Template. Use the green button to copy template in your own repo, clone the repo and run in the repo folder: docker-compose up -d you will get InterSystems IRIS 2019.4 Community Edition running with the template classes to load input data from files and Day1 solution. This is also set up to start crafting solutions of Advent of Code 2019 and edit, compile and debug ObjectScript with VSCode addon. Happy coding with Advent of Code 2019!

Announcement

Anastasia Dyubaylo · Dec 6, 2018

Review InterSystems IRIS or Caché and get two $25 Visa Cards!

Hey Developers!Good news! Just in time for the holidays, Gartner Peer Insights is offering customers a $25 digital Visa Gift Card for an approved review of InterSystems IRIS or Caché this month!We decided to support and double the stakes. So! In December '18 you can get the second $25 digital Visa Gift Card for Gartner review on Caché or InterSystems IRIS on InterSystems Global Masters Advocacy Hub!See the rules below.Step #1: To get $25 Visa Card from Gartner Peer Insights, follow this unique link and submit a review. Make a screenshot for Step#2 so that we can see that you reviewed InterSystems IRIS or Caché.Note: The survey takes about 10-15 minutes. Gartner will authenticate the identity of the reviewer, but the published reviews are anonymous. You can check the status of your review and gift card in your Gartner Peer Insights reviewer profile at any time.Step #2: To get $25 Visa Card from InterSystems, complete a dedicated challenge on InterSystems Global Masters Advocacy Hub — upload a screenshot from Step #1.Don't forget: • This promotion is only for reviews entered in the month of December 2018. • InterSystems IRIS and Caché reviews only. • Use mentioned above unique link in order to qualify for the gift cards.Done? Awesome! Your card is on its way! To join Global Masters leave a comment to the post and we'll send the invite! Hurry up to get your $100 from December Caché and IRIS campaign from Gartner and InterSystems! ;) Only 12 days left!The recipe is the following: 1. You are our current customer of Caché or/and InterSystems IRIS.2. Make the review using this link.3. Get your $25 for Caché or InterSystems IRIS review ($50 for both).4. Save the screenshots of reviews and submit it in Global Masters - get another $25 for every Caché and InterSystems IRIS from Global Masters.5. Merry Christmas and have a great new year 2019! This is a good idea, hopefully everyone will do this but I did have a problem.Perhaps I have done this incorrectly but I could not see a way to submit screenshots in the challenge and when you click the "lets review" button, or whatever the actual text was, it closes it as completed and there seems no way to submit a screenshot. Also, the link to the challenge is for the same challenge number as it appears in and it takes you to the Global Masters front page.Also, you don't seem able to review both as suggested, if you use the link again or search for the platform you haven't reviewed yet it will simply state you have already submitted a review. I suspect this is because using the link you have to choose between Iris or Cache and so the offer is for one or the other but not both. Hi David! Thanks for reporting this. Our support team will contact you via GM direct messaging. Dear Community Members!Thank you so much for making reviews! You made InterSystems Data Platforms Caché and InterSystems IRIS a Gartner Customers' Choice 2019 in Operational Database management Systems!

Announcement

Anastasia Dyubaylo · Dec 12, 2018

[December 20, 2018] Upcoming Webinar: Using Blockchain with InterSystems IRIS

Hi Community!We are pleased to invite you to the upcoming webinar "Using Blockchain with InterSystems IRIS" on 20th of December at 10:00 (Moscow time)! Blockchain is a technology of distributed information storage and mechanisms to ensure its integrity. Blockchain is becoming more common in various areas, such as the financial sector, government agencies, healthcare and others.InterSystems IRIS makes it easy to integrate with one of the most popular blockchain networks – Ethereum. At the webinar we will talk about what a blockchain is and how you can start using it in your business. We will also demonstrate the capabilities of the Ethereum adapter for creating applications using the Ethereum blockchain.The following topics are planned to be considered:Introduction to BlockchainEthereumSmart contracts in EthereumIntersystems IRIS adapter for EthereumApplication example using adapterPresenter: @Nikolay.SolovievAudience: The webinar is designed for developers.Note: The language of the webinar is Russian.We are waiting for you at our webinar! Register now! It is tomorrow! Don't miss Register here! And now this webinar recording is available in a dedicated Webinars in Russian playlist on InterSystems Developers YouTube: Enjoy it!

Question

Nikhil Pawaria · Jan 25, 2019

How to reduce the size of InterSystems Caché database file CACHE.DAT

How we can reduce the size of cache.dat file? Even after deleting the globals of a particular database from management portal size of its cache.dat file is not reduced. This is the way to do it, but make sure you are on a version where this won't cause problems. See:https://www.intersystems.com/support-learning/support/product-news-alerts/support-alert/alert-database-compaction/https://www.intersystems.com/support-learning/support/product-news-alerts/support-alerts-2015/https://www.intersystems.com/support-learning/support/product-news-alerts/support-alerts-2012/ You need to do these three steps in order:Compact Globals in a Database (optional)Compact a DatabaseTruncate a DatabaseIn can be done via ^DATABASE utility or in management portal. CACHE.DAT or IRIS.DAT, can only grow during normal work. But you can shrink it manually. But it is not as easy as it maybe sounds. And depends on version which you use, only past few versions were added with the compact tool. On very old versions you have to copy data from old database to thew new one.You can read my articles, about internal structure of CACHE.DAT, just to know what is this inside. And about database with visualization, where you can see how to compact database, and how it actually works.

Article

Mark Bolinsky · Feb 12, 2019

InterSystems IRIS Example Reference Architectures for Amazon Web Services (AWS)

The Amazon Web Services (AWS) Cloud provides a broad set of infrastructure services, such as compute resources, storage options, and networking that are delivered as a utility: on-demand, available in seconds, with pay-as-you-go pricing. New services can be provisioned quickly, without upfront capital expense. This allows enterprises, start-ups, small and medium-sized businesses, and customers in the public sector to access the building blocks they need to respond quickly to changing business requirements. Updated: 10-Jan, 2023 The following overview and details are provided by Amazon and can be found here. Overview AWS Global Infrastructure The AWS Cloud infrastructure is built around Regions and Availability Zones (AZs). A Region is a physical location in the world where we have multiple AZs. AZs consist of one or more discrete data centers, each with redundant power, networking, and connectivity, housed in separate facilities. These AZs offer you the ability to operate production applications and databases that are more highly available, fault tolerant, and scalable than would be possible from a single data center. Details of AWS Global Infrastructure can be found here. AWS Security and Compliance Security in the cloud is much like security in your on-premises data centers—only without the costs of maintaining facilities and hardware. In the cloud, you don’t have to manage physical servers or storage devices. Instead, you use software-based security tools to monitor and protect the flow of information into and of out of your cloud resources. The AWS Cloud enables a shared responsibility model. While AWS manages security of the cloud, you are responsible for security in the cloud. This means that you retain control of the security you choose to implement to protect your own content, platform, applications, systems, and networks no differently than you would in an on-site data center. Details of AWS Cloud Security can be found here. The IT infrastructure that AWS provides to its customers is designed and managed in alignment with best security practices and a variety of IT security standards. A complete list of assurance programs with which AWS complies with can be found here. AWS Cloud Platform AWS consists of many cloud services that you can use in combinations tailored to your business or organizational needs. The following sub-section introduces the major AWS services by category that are commonly used with InterSystems IRIS deployments. There are many other services available and potentially useful for your specific application. Be sure to research those as needed. To access the services, you can use the AWS Management Console, the Command Line Interface, or Software Development Kits (SDKs). AWS Cloud Platform Component Details AWS Management Console Details of the AWS Management Console can be found here. AWS Command-line interface Details of the AWS Command Line Interface (CLI) can be found here. AWS Software Development Kits (SDK) Details of AWS Software Development Kits (SDK) can be found here. AWS Compute There are numerous options available: Details of Amazon Elastic Cloud Computing (EC2) can be found here Details of Amazon EC2 Container Service (ECS) can be found here Details of Amazon EC2 Container Registry (ECR) can be found here Details of Amazon Auto Scaling can be found here AWS Storage There are numerous options available: Details of Amazon Elastic Block Store (EBS) can be found here Details of Amazon Simple Storage Service (S3) can be found here Details of Amazon Elastic File System (EFS) can be found here AWS Networking There are numerous options available. Details of Amazon Virtual Private Cloud (VPC) can be found here Details of Amazon Elastic IP Addresses can be found here Details of Amazon Elastic Network Interfaces can be found here Details of Amazon Enhanced Networking for Linux can be found here Details of Amazon Elastic Load Balancing (ELB) can be found here Details of Amazon Route 53 can be found here InterSystems IRIS Sample Architectures As part of this article, sample InterSystems IRIS deployments for AWS are provided as a starting point for your application specific deployment. These can be used as a guideline for numerous deployment possibilities. This reference architecture demonstrates highly robust deployment options starting with the smallest deployments to massively scalable workloads for both compute and data requirements. High availability and disaster recovery options are covered in this document along with other recommended system operations. It is expected these will be modified by the individual to support their organization’s standard practices and security policies. InterSystems is available for further discussions or questions of AWS-based InterSystems IRIS deployments for your specific application. Sample Reference Architectures The following sample architectures will provide several different configurations with increasing capacity and capabilities. Consider these examples of small development / production / large production / production with sharded cluster that show the progression from starting with a small modest configuration for development efforts and then growing to massively scalable solutions with proper high availability across zones and multi-region disaster recovery. In addition, an example architecture of using the new sharding capabilities of InterSystems IRIS Data Platform for hybrid workloads with massively parallel SQL query processing. Small Development Configuration In this example, a minimal configuration is used to illustrates a small development environment capable of supporting up to 10 developers and 100GB of data. More developers and stored data can easily be supported by simply changing the virtual machine instance type and increasing storage of the EBS volume(s) as appropriate. This is adequate to support development efforts and become familiar with InterSystems IRIS functionality along with Docker container building and orchestration if desired. High availability with database mirroring is typically not used with a small configuration, however it can be added at any time if high availability is needed. Small Configuration Sample Diagram The below sample diagram in Figure 2.1.1-a illustrates the table of resources in Figure 2.1.1-b. The gateways included are just examples, and can be adjusted accordingly to suit your organization’s standard network practices. Figure-2.1.1-a: Sample Small Development Architecture The following resources within the AWS VPC are provisioned as a minimum small configuration. AWS resources can be added or removed as required. Small Configuration AWS Resources Sample of Small Configuration AWS resources is provided below in the following table. Proper network security and firewall rules need to be considered to prevent unwanted access into the VPC. Amazon provides network security best practices for getting started which can be found here: https://docs.aws.amazon.com/vpc/index.html#lang/en_us https://docs.aws.amazon.com/quickstart/latest/vpc/architecture.html#best-practices Note: VM instances require a public IP address to reach AWS services. While this practice might raise some concerns, AWS recommends limiting the incoming traffic to these VM instances by using firewall rules. If your security policy requires truly internal VM instances, you will need to set up a NAT proxy manually on your network and a corresponding route so that the internal instances can reach the Internet. It is important to note that you cannot connect to a fully internal VM instance directly by using SSH. To connect to such internal machines, you must set up a bastion instance that has an external IP address and then tunnel through it. A bastion Host can be provisioned to provide the external facing point of entry into your VPC. Details of using a bastion hosts can he found here: https://aws.amazon.com/blogs/security/controlling-network-access-to-ec2-instances-using-a-bastion-server/ https://docs.aws.amazon.com/quickstart/latest/linux-bastion/architecture.html Production Configuration In this example, a more sizable configuration as an example production configuration that incorporates InterSystems IRIS database mirroring capability to support high availability and disaster recovery. Included in this configuration is a synchronous mirror pair of InterSystems IRIS database servers split between two availability zones within region-1 for automatic failover, and a third DR asynchronous mirror member in region-2 for disaster recovery in the unlikely event an entire AWS region is offline. Details of a multiple Region with Multi-VPC Connectivity can be found here. The InterSystems Arbiter and ICM server deployed in a separate third zone for added resiliency. The sample architecture also includes a set of optional load balanced web servers to support a web-enabled application. These web servers with the InterSystems Gateway can be scaled independently as needed. Production Configuration Sample Diagram The sample diagram in Figure 2.2.1-a illustrates the table of resources in Figure 2.2.1-b. The gateways included are just examples, and can be adjusted accordingly to suit your organization’s standard network practices. Figure 2.2.1-a: Sample Production Architecture with High Availability and Disaster Recovery The following resources within the AWS VPC are recommended as a minimum to support a production workload for a web application. AWS resources can be added or removed as required. Production Configuration AWS Resources Sample of Production Configuration AWS resources is provided below in the following table. Large Production Configuration In this example, a massively scaled configuration is provided by expanding on the InterSystems IRIS capability to also introduce application servers using InterSystems’ Enterprise Cache Protocol (ECP) to provide massive horizontal scaling of users. An even higher level of availability is included in this example because of ECP clients preserving session details even in the event of a database instance failover. Multiple AWS availability zones are used with both ECP-based application servers and database mirror members deployed in multiple regions. This configuration is capable of supporting tens of millions database accesses per second and multiple terabytes of data. Production Configuration Sample Diagram The sample diagram in Figure 2.3.1-a illustrates the table of resources in Figure 2.3.1-b. The gateways included are just examples, and can be adjusted accordingly to suit your organization’s standard network practices. Included in this configuration is a failover mirror pair, four or more ECP clients (application servers), and one or more web servers per application server. The failover database mirror pairs are split between two different AWS availability zones in the same region for fault domain protection with the InterSystems Arbiter and ICM server deployed in a separate third zone for added resiliency. Disaster recovery extends to a second AWS region and availability zone(s) similar to the earlier example. Multiple DR regions can be used with multiple DR Async mirror member targets if desired. Figure 2.3.1-a: Sample Large Production Architecture with ECP Application Servers The following resources within the AWS VPC Project are recommended as a minimum recommendation to support a sharded cluster. AWS resources can be added or removed as required. Large Production Configuration AWS Resources Sample of Large Production Configuration AWS resources is provided below in the following table. Production Configuration with InterSystems IRIS Sharded Cluster In this example, a horizontally scaled configuration for hybrid workloads with SQL is provided by including the new sharded cluster capabilities of InterSystems IRIS to provide massive horizontal scaling of SQL queries and tables across multiple systems. Details of InterSystems IRIS sharded cluster and its capabilities are discussed further in section 9 of this article. Production with Sharded Cluster Configuration Sample Diagram The sample diagram in Figure 2.4.1-a illustrates the table of resources in Figure 2.4.1-b. The gateways included are just examples, and can be adjusted accordingly to suit your organization’s standard network practices. Included in this configuration are four mirror pairs as the data nodes. Each of the failover database mirror pairs are split between two different AWS availability zones in the same region for fault domain protection with the InterSystems Arbiter and ICM server deployed in a separate third zone for added resiliency. This configuration allows for all the database access methods to be available from any data node in the cluster. The large SQL table(s) data is physically partitioned across all data nodes to allow for massive parallelization of both query processing and data volume. Combining all these capabilities provides the ability to support complex hybrid workloads such as large-scale analytical SQL querying with concurrent ingestion of new data, all within a single InterSystems IRIS Data Platform. Figure 2.4.1-a: Sample Production Configuration with Sharded Cluster with High Availability Note that in the above diagram and the “resource type” column in the table below, the term “EC2” is an AWS term representing an AWS virtual server instance as described further in section 3.1 of this document. It does not represent or imply the use of “compute nodes” in the cluster architecture described in chapter 9. The following resources within the AWS VPC are recommended as a minimum recommendation to support a sharded cluster. AWS resources can be added or removed as required. Production with Sharded Cluster Configuration AWS Resources Sample of Production with Sharded Cluster Configuration AWS resources is provided below in the following table. Introduction to Cloud Concepts Amazon Web Services (AWS) provides a feature rich cloud environment for Infrastructure-as-a-Service (IaaS) fully capable of supporting all of InterSystems products including support for container-based DevOps with the new InterSystems IRIS Data Platform. Care must be taken, as with any platform or deployment model, to ensure all aspects of an environment are considered such as performance, availability, system operations, high availability, disaster recovery, security controls, and other management procedures. This article will cover the three major components of all cloud deployments: Compute, Storage, and Networking. Compute Engines (Virtual Machines) Within AWS EC2 there are several options available for compute engine resources with numerous virtual CPU and memory specifications and associated storage options. One item to note within AWS EC2, references to the number of vCPUs in a given machine type equates to one vCPU is one hyper-thread on the physical host at the hypervisor layer. For the purposes of this document m5* and r5* EC2 instance types will be used and are most widely available in most AWS deployment regions. However, the use of other specialized instance types such as: x1* with very large memory are great options for very large working datasets keeping massive amounts of data cached in memory, or i3* with NVMe local instance storage. Details of the AWS Service Level Agreement (SLA) can be found here. Disk Storage The storage type most directly related to InterSystems products are the persistent disk types, however local storage may be used for high levels of performance if data availability restrictions are understood and accommodated. There are several other options such as S3 (buckets) and Elastic File Store (EFS), however those are more specific to an individual application’s requirements rather than supporting the operation of InterSystems IRIS Data Platform. Like most other cloud providers, AWS imposes limitations on the amount of persistent storage that can be associated to an individual compute engine. These limits include the maximum size of each disk, the number of persistent disks attached to each compute engine, and the amount of IOPS per persistent disk with an overall individual compute engine instance IOPS cap. In addition, there are imposed IOPS limits per GB of disk space, so at times provisioning more disk capacity is required to achieve desired IOPS rate. These limits may change over time and to be confirmed with AWS as appropriate. There are three types of persistent storage types for disk volumes: EBS gp2 (SSD), EBS st1 (HDD), and EBS io1 (SSD). The standard EBS gp2 disks are more suited for production workloads that require predictable low-latency IOPS and higher throughput. Standard Persistent disks are more an economical option for non-production development and test or archive type workloads. Details of the various disk types and limitations can be found here. VPC Networking The virtual private cloud (VPC) network is highly recommended to support the various components of InterSystems IRIS Data Platform along with providing proper network security controls, various gateways, routing, internal IP address assignments, network interface isolation, and access controls. An example VPC will be detailed in the examples provided within this document. Details of VPC networking and firewalls can be found here. Virtual Private Cloud (VPC) Overview Details of AWS VPC are provided here. In most large cloud deployments, multiple VPCs are provisioned to isolate the various gateways types from application-centric VPCs and leverage VPC peering for inbound and outbound communications. It is highly recommended to consult with your network administrator for details on allowable subnets and any organizational firewall rules of your company. VPC peering is not covered in this document. In the examples provided in this document, a single VPC with three subnets will be used to provide network isolation of the various components for predictable latency and bandwidth and security isolation of the various InterSystems IRIS components. Network Gateway and Subnet Definitions Two gateways are provided in the example in this document to support both Internet and secure VPN connectivity. Each ingress access is required to have appropriate firewall and routing rules to provide adequate security for the application. Details on how to use VPC Route Tables can be found here. Three subnets are used in the provided example architectures dedicated for use with InterSystems IRIS Data Platform. The use of these separate network subnets and network interfaces allows for flexibility in security controls and bandwidth protection and monitoring for each of the three above major components. Details for creating virtual machine instances with multiple network interfaces can be found here. The subnets included in these examples: User Space Network for Inbound connected users and queries Shard Network for Inter-shard communications between the shard nodes Mirroring Network for high availability using synchronous replication and automatic failover of individual data nodes. Note: Failover synchronous database mirroring is only recommended between multiple zones which have low latency interconnects within a single AWS region. Latency between regions is typically too high for to provide a positive user experience especially for deployment with a high rate of updates. Internal Load Balancers Most IaaS cloud providers lack the ability to provide for a Virtual IP (VIP) address that is typically used in automatic database failover designs. To address this, several of the most commonly used connectivity methods, specifically ECP clients and Web Gateways, are enhanced within InterSystems IRIS to no longer rely on VIP capabilities making them mirror-aware and automatic. Connectivity methods such as xDBC, direct TCP/IP sockets, or other direct connect protocols, require the use of a VIP-like address. To support those inbound protocols, InterSystems database mirroring technology makes it possible to provide automatic failover for those connectivity methods within AWS using a health check status page called mirror_status.cxw to interact with the load balancer to achieve VIP-like functionality of the load balancer only directing traffic to the active primary mirror member, thus providing a complete and robust high availability design within AWS. Details of AWS Elastic Load Balancer (ELB) can be found here. Figure 4.2-a: Automatic Failover without a Virtual IP Address Details of using a load balancer to provide VIP-like functionality is provided here. // Update 2023-01-10: There is a new recommended VIP model for AWS that is more robust and alleviates the need for a load balancer to provide VIP-like capabilities. Details can be found here. Sample VPC Topology Combining all the components together, the following illustration in Figure 4.3-a demonstrates the layout of a VPC with the following characteristics: Leverages multiple zones within a region for high availability Provides two regions for disaster recovery Utilizes multiple subnets for network segregation Includes separate gateways for VPC Peering, Internet, and VPN connectivity Uses cloud load balancer for IP failover for mirror members Please note in AWS each subnet must reside entirely within one availability zone and cannot span zones. So, in the example below, network security or routing rules need to be properly defined. Details on AWS VPC subnets can be found here. Figure 4.3-a: Example VPC Network Topology Persistent Storage Overview As discussed in the introduction, the use of AWS Elastic Block Store (EBS) Volumes is recommended and specifically EBS gp2 or the latest gp3 volume types. EBS gp3 volumes are recommended due to the higher read and write IOPS rates and low latency required for transactional and analytical database workloads. Local SSDs may be used in certain circumstances, however beware that the performance gains of local SSDs comes with certain trade-offs in availability, durability, and flexibility. Details of Local SSD data persistence can be found here to understand the events of when Local SSD data is preserved and when not. LVM PE Striping Like other cloud providers, AWS imposes numerous limits on storage both in IOPS, space capacity, and number of devices per virtual machine instance. Consult AWS documentation for current limits which can be found here. With these limits, LVM striping becomes necessary to maximize IOPS beyond that of a single disk device for a database instance. In the example virtual machine instances provided, the following disk layouts are recommended. Performance limits associated with SSD persistent disks can be found here. Note: There is currently a maximum of 40 EBS volumes per Linux EC2 instance although AWS resource capabilities change often so please consult with AWS documentation for current limitations. Figure 5.1-a: Example LVM Volume Group Allocation The benefits of LVM striping allows for spreading out random IO workloads to more disk devices and inherit disk queues. Below is an example of how to use LVM striping with Linux for the database volume group. This example will use four disks in an LVM PE stripe with a physical extent (PE) size of 4MB. Alternatively, larger PE sizes can be used if needed. Step 1: Create Standard or SSD Persistent Disks as needed Step 2: IO scheduler is NOOP for each of the disk devices using “lsblk -do NAME,SCHED” Step 3: Identify disk devices using “lsblk -do KNAME,TYPE,SIZE,MODEL” Step 4: Create Volume Group with new disk devices vgcreate s 4M <vg name> <list of all disks just created> example: vgcreate -s 4M vg_iris_db /dev/sd[h-k] Step 4: Create Logical Volume lvcreate n <lv name> -L <size of LV> -i <number of disks in volume group> -I 4MB <vg name> example: lvcreate -n lv_irisdb01 -L 1000G -i 4 -I 4M vg_iris_db Step 5: Create File System mkfs.xfs K <logical volume device> example: mkfs.xfs -K /dev/vg_iris_db/lv_irisdb01 Step 6: Mount File System edit /etc/fstab with following mount entries /dev/mapper/vg_iris_db-lv_irisdb01 /vol-iris/db xfs defaults 0 0 mount /vol-iris/db Using the above table, each of the InterSystems IRIS servers will have the following configuration with two disks for SYS, four disks for DB, two disks for primary journals and two disks for alternate journals. Figure 5.1-b: InterSystems IRIS LVM Configuration For growth LVM allows for expanding devices and logical volumes when needed without interruption. Consult with Linux documentation on best practices for ongoing management and expansion of LVM volumes. Note: The enablement of asynchronous IO for both the database and the write image journal files are highly recommend. See the community article for details on enabling on Linux. Provisioning New with InterSystems IRIS is InterSystems Cloud Manager (ICM). ICM carries out many tasks and offers many options for provisioning InterSystems IRIS Data Platform. ICM is provided as a Docker image that includes everything for provisioning a robust AWS cloud-based solution. ICM currently support provisioning on the following platforms: Amazon Web Services including GovCloud (AWS / GovCloud) Google Cloud Platform (GCP) Microsoft Azure Resource Manager including Government (ARM / MAG) VMware vSphere (ESXi) ICM and Docker can run from either a desktop/laptop workstation or have a centralized dedicated modest “provisioning” server and centralized repository. The role of ICM in the application lifecycle is Define -> Provision -> Deploy -> Manage Details for installing and using ICM with Docker can be found here. NOTE: The use of ICM is not required for any cloud deployment. The traditional method of installation and deployment with tar-ball distributions is fully supported and available. However, ICM is recommended for ease of provisioning and management in cloud deployments. Container Monitoring ICM includes two basic monitoring facilities for container-based deployments: Rancherand Weave Scope. Neither are deployed by default, and need to be specified in the defaults file using the Monitorfield. Details for monitoring, orchestration, and scheduling with ICM can be found here. An overview of Rancher and documentation can be found here. An overview of Weave Scope and documentation can be found here. High Availability InterSystems database mirroring provides the highest level of availability in any cloud environment. AWS does not provide any availability guarantees for a single EC2 instance, so database mirroring is required database tier which can also be coupled with load balancing and auto-scale groups. Earlier sections discussed how a cloud load balancer will provide automatic IP address failover for a Virtual IP (VIP-like) capability with database mirroring. The cloud load balancer uses the mirror_status.cxwhealth check status page mentioned earlier in the Internal Load Balancerssection. There are two modes of database mirroring - synchronous with automatic failover and asynchronous mirroring. In this example, synchronous failover mirroring will be covered. The details of mirroring can he found here. The most basic mirroring configuration is a pair of failover mirror members in an arbiter-controlled configuration. The arbiter is placed in a third zone within the same region to protect from potential availability zone outages impacting both the arbiter and one of the mirror members. There are many ways mirroring can be setup specifically in the network configuration. In this example, we will use the network subnets defined previously in the Network Gateway and Subnet Definitions section of this document. Example IP address schemes will be provided in a following section and for the purpose of this section, only the network interfaces and designated subnets will be depicted. Figure 7-a: Sample mirror configuration with arbiter Disaster Recovery InterSystems database mirroring extends the capability of high available to also support disaster recovery to another AWS geographic region to support operational resiliency in the unlikely event of an entire AWS region going offline. How an application is to endure such outages depends on the recovery time objective (RTO) and recovery point objectives (RPO). These will provide the initial framework for the analysis required to design a proper disaster recovery plan. The following link provides a guide for the items to be considered when developing a disaster recovery plan for your application. https://aws.amazon.com/disaster-recovery/ Asynchronous Database Mirroring InterSystems IRIS Data Platform’s database mirroring provides robust capabilities for asynchronously replicating data between AWS availability zones and regions to help support the RTO and RPO goals of your disaster recovery plan. Details of async mirror members can be found here. Similar to the earlier high availability section, a cloud load balancer will provide automatic IP address failover for a Virtual IP (VIP-like) capability for DR asynchronous mirroring as well using the same mirror_status.cxw health check status page mentioned earlier in the Internal Load Balancers section. In this example, DR asynchronous failover mirroring will be covered along with the introduction of the AWS Route53 DNS service to provide upstream systems and client workstations with a single DNS address regardless of which availability zone or region your InterSystems IRIS deployment is operating. Details of AWS Route53 can be found here. Figure 8.1-a: Sample DR Asynchronous Mirroring with AWS Route53 In the above example, the IP addresses of both region’s Elastic Load Balancer (ELB) that front-end the InterSystems IRIS instances are provided Route53, and it will only direct traffic to whichever mirror member is the active primary mirror regardless of the availability zone or region it is located. Sharded Cluster InterSystems IRIS includes a comprehensive set of capabilities to scale your applications, which can be applied alone or in combination, depending on the nature of your workload and the specific performance challenges it faces. One of these, sharding, partitions both data and its associated cache across a number of servers, providing flexible, inexpensive performance scaling for queries and data ingestion while maximizing infrastructure value through highly efficient resource utilization. An InterSystems IRIS sharded cluster can provide significant performance benefits for a wide variety of applications, but especially for those with workloads that include one or more of the following: High-volume or high-speed data ingestion, or a combination. Relatively large data sets, queries that return large amounts of data, or both. Complex queries that do large amounts of data processing, such as those that scan a lot of data on disk or involve significant compute work. Each of these factors on its own influences the potential gain from sharding, but the benefit may be enhanced where they combine. For example, a combination of all three factors — large amounts of data ingested quickly, large data sets, and complex queries that retrieve and process a lot of data — makes many of today’s analytic workloads very good candidates for sharding. Note that these characteristics all have to do with data; the primary function of InterSystems IRIS sharding is to scale for data volume. However, a sharded cluster can also include features that scale for user volume, when workloads involving some or all of these data-related factors also experience a very high query volume from large numbers of users. Sharding can be combined with vertical scaling as well. Operational Overview The heart of the sharded architecture is the partitioning of data and its associated cache across a number of systems. A sharded cluster physically partitions large database tables horizontally — that is, by row — across multiple InterSystems IRIS instances, called data nodes, while allowing applications to transparently access these tables through any node and still see the whole dataset as one logical union. This architecture provides three advantages: Parallel processing Queries are run in parallel on the data nodes, with the results merged, combined, and returned to the application as full query results by the node the application connected to, significantly enhancing execution speed in many cases. Partitioned caching Each data node has its own cache, dedicated to the sharded table data partition it stores, rather than a single instance’s cache serving the entire data set, which greatly reduces the risk of overflowing the cache and forcing performance-degrading disk reads. Parallel loading Data can be loaded onto the data nodes in parallel, reducing cache and disk contention between the ingestion workload and the query workload and improving the performance of both. Details of InterSystems IRIS sharded cluster can be found here. Elements of Sharding and Instance Types A sharded cluster consists of at least one data node and, if needed for specific performance or workload requirements, an optional number of compute nodes. These two node types offer simple building blocks presenting a simple, transparent, and efficient scaling model. Data Nodes Data nodes store data. At the physical level, sharded table[1]data is spread across all data nodes in the cluster and non-sharded table data is physically stored on the first data node only. This distinction is transparent to the user with the possible sole exception that the first node might have a slightly higher storage consumption than the others, but this difference is expected to become negligible as sharded table data would typically outweigh non-sharded table data by at least an order of magnitude. Sharded table data can be rebalanced across the cluster when needed, typically after adding new data nodes. This will move “buckets” of data between nodes to approximate an even distribution of data. At the logical level, non-sharded table data and the union of all sharded table data is visible from any node, so clients will see the whole dataset, regardless of which node they’re connecting to. Metadata and code are also shared across all data nodes. The basic architecture diagram for a sharded cluster simply consists of data nodes that appear uniform across the cluster. Client applications can connect to any node and will experience the data as if it were local. Figure 9.2.1-a: Basic Sharded Cluster Diagram [1]For convenience, the term “sharded table data” is used throughout the document to represent “extent” data for any data model supporting sharding that is marked as sharded. The terms “non-sharded table data” and “non-sharded data” are used to represent data that is in a shardable extent not marked as such or for a data model that simply doesn’t support sharding yet. Compute Nodes For advanced scenarios where low latencies are required, potentially at odds with a constant influx of data, compute nodes can be added to provide a transparent caching layer for servicing queries. Compute nodes cache data. Each compute node is associated with a data node for which it caches the corresponding sharded table data and, in addition to that, it also caches non-sharded table data as needed to satisfy queries. Figure 9.2.2-a: Shard cluster with Compute Nodes Because compute nodes don’t physically store any data and are meant to support query execution, their hardware profile can be tailored to suit those needs, for example by emphasizing memory and CPU and keeping storage to the bare minimum. Ingestion is forwarded to the data nodes, either directly by the driver (xDBC, Spark) or implicitly by the sharding manager code when “bare” application code runs on a compute node. Sharded Cluster Illustrations There are various combinations of deploying a sharded cluster. The following high-level diagrams are provided to illustrate the most common deployment models. These diagrams do not include the networking gateways and details and provide to focus only on the sharded cluster components. Basic Sharded Cluster The following diagram is the simplest sharded cluster with four data nodes deployed in a single region and in a single zone. An AWS Elastic Load Balancer (ELB) is used to distribute client connections to any of the sharded cluster nodes Figure 9.3.1-a: Basic Sharded Cluster In this basic model, there is no resiliency or high availability provided beyond that of what AWS provides for a single virtual machine and its attached SSD persistent storage. Two separate network interface adapters are recommended to provide both network security isolation for the inbound client connections and also bandwidth isolation between the client traffic and the sharded cluster communications. Basic Sharded Cluster with High Availability The following diagram is the simplest sharded cluster with four mirrored data nodes deployed in a single region and splitting each node’s mirror between zones. An AWS Load Balancer is used to distribute client connections to any of the sharded cluster nodes. High availability is provided through the use of InterSystems database mirroring which will maintain a synchronously replicated mirror in a secondary zone within the region. Three separate network interface adapters are recommended to provide both network security isolation for the inbound client connections and bandwidth isolation between the client traffic, the sharded cluster communications, and the synchronous mirror traffic between the node pairs. Figure 9.3.2-a: Basic Sharded Cluster with High Availability This deployment model also introduces the mirror arbiter as described in an earlier section of this article. Sharded Cluster with Separate Compute Nodes The following diagram expands the sharded cluster for massive user/query concurrency with separate compute nodes and four data nodes. The Cloud Load Balancer server pool only contains the addresses of the compute nodes. Updates and data ingestion will continue to update directly to the data nodes as before to sustain ultra-low latency performance and avoid interference and congestion of resources between query/analytical workloads from real-time data ingestion. With this model the allocation of resources can be fine-tuned for scaling of compute/query and ingestion independently allowing for optimal resources where needed in a “just-in-time” and maintaining an economical yet simple solution instead of wasting resources unnecessarily just to scale compute or data. Compute Nodes lend themselves for a very straightforward use of AWS auto scale grouping (aka Autoscaling) to allow for automatic addition or deletion of instances from a managed instance group based on increased or decreased load. Autoscaling works by adding more instances to your instance group when there is more load (upscaling), and deleting instances when the need for instances is lowered (downscaling). Details of AWS Autoscaling can be found here. Figure 9.3.3-a: Sharded Cluster with Separate Compute and Data Nodes Autoscaling helps cloud-based applications gracefully handle increases in traffic and reduces cost when the need for resources is lower. Simply define the policy and the auto-scaler performs automatic scaling based on the measured load. Backup Operations There are multiple options available for backup operations. The following three options are viable for your AWS deployment with InterSystems IRIS. The first two options, detailed below, incorporate a snapshot type procedure which involves suspending database writes to disk prior to creating the snapshot and then resuming updates once the snapshot was successful. The following high-level steps are taken to create a clean backup using either of the snapshot methods: Pause writes to the database via database External Freeze API call. Create snapshots of the OS + data disks. Resume database writes via External Thaw API call. Backup facility archives to backup location Details of the External Freeze/Thaw APIs can be found here. Note: Sample scripts for backups are not included in this document, however periodically check for examples posted to the InterSystems Developer Community. www.community.intersystems.com The third option is InterSystems Online backup. This is an entry-level approach for smaller deployments with a very simple use case and interface. However, as databases increase in size, external backups with snapshot technology are recommended as a best practice with advantages including the backup of external files, faster restore times, and an enterprise-wide view of data and management tools. Additional steps such as integrity checks can be added on a periodic interval to ensure clean and consistent backup. The decision points on which option to use depends on the operational requirements and policies of your organization. InterSystems is available to discuss the various options in more detail. AWS Elastic Block Store (EBS) Snapshot Backup Backup operations can be achieved using AWS CLI command-line API along with InterSystems ExternalFreeze/Thaw API capabilities. This allows for true 24x7 operational resiliency and assurance of clean regular backups. Details for managing and creating and automation AWS EBS snapshots can be found here. Logical Volume Manager (LVM) Snapshots Alternatively, many of the third-party backup tools available on the market can be used by deploying individual backup agents within the VM itself and leveraging file-level backups in conjunction with Logical Volume Manager (LVM) snapshots. One of the major benefits to this model is having the ability to have file-level restores of either Windows or Linux based VMs. A couple of points to note with this solution, is since AWS and most other IaaS cloud providers do not provide tape media, all backup repositories are disk-based for short term archiving and have the ability to leverage blob or bucket type low cost storage for long-term retention (LTR). It is highly recommended if using this method to use a backup product that supports de-duplication technologies to make the most efficient use of disk-based backup repositories. Some examples of these backup products with cloud support include but is not limited to: Commvault, EMC Networker, HPE Data Protector, and Veritas Netbackup. InterSystems does not validate or endorses one product over the other. Online Backup For small deployments the built-in Online Backup facility is also a viable option as well. This InterSystems database online backup utility backs up data in database files by capturing all blocks in the databases then writes the output to a sequential file. This proprietary backup mechanism is designed to cause no downtime to users of the production system. Details of Online Backup can be found here. In AWS, after the online backup has finished, the backup output file and all other files in use by the system must be copied to some other storage location outside of that virtual machine instance. Bucket/Object storage is a good designation for this. There are two option for using an AWS Single Storage Space (S3) bucket. Use the AWS CLIscripting APIs directly to copy and manipulate the newly created online backup (and other non-database) files Details can be found here. Mount an Elastic File Store (EFS) volume and use it similarly as a persistent disk at a low cost. Details of EFS a can be found here. @Mark.Bolinsky ,You do insanely good work... thanks for this. Additionally, InterSystems IRIS and IRIS for Health are now available within the AWS marketplace: https://aws.amazon.com/marketplace/seller-profile?id=6e5272fb-ecd1-4111-8691-e5e24229826f Thanks gentlemen for your documentation work. @Mark Bolinsky: It would be so perfect if you could share YAML templates to choose and deploy directly some of your examples, as done here 😉

Question

Evgeny Shvarov · Feb 12, 2019

How many namespaces and databases could be in one InterSystems IRIS installation?

Hi Community!What's the limit for Namespaces and Databases for one InterSystems IRIS installation?Yes, I checked with documentation but cannot find it at once. to my understanding, there is no technical limit.Though I believe to remember that it used to be ~16.000 some time in past.Class SYS.Database maps to ^SYS("CONFIG","IRIS","Databases",<DBNAME>) and has NO limit theresimilar Namespaces are stored in SYS("CONFIG","IRIS","Namespaces",<NSPCE>) an are covered by %SYS.Namespace If there is any limit it must be related to internal memory structures. (gmheap ??)

first
previous
…
92
93
94
95
96
…
next
last