Article Niyaz Khafizov · Oct 8, 2018 16m read

Hi all. We are going to find duplicates in a dataset using Apache Spark Machine Learning algorithms.

Note: I have done the following on Ubuntu 18.04, Python 3.6.5, Zeppelin 0.8.0, Spark 2.1.1

Introduction

In previous articles we have done the following:

In this series of articles, we explore Machine Learning and record linkage.

0
1 778
Question Niyaz Khafizov · Sep 20, 2018

Hi all.

I want to insert my dataframe into InterSystems IRIS. So, I tried to do this:

df = spark.read.load("/home/imported-openssh-key/zeppelin-0.8.0-bin-all/bin/resultData3/DF.json", format="json")
df.write.format("com.intersystems.spark").\
option("url", "IRIS://localhost:51773/DEDUPL").\
option("user", "********").option("password", "********").\
option("dbtable", "try.test1").save()

And got this error:

Py4JJavaError: An error occurred while calling o80.save.
: java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.
3
0 2156
Article Niyaz Khafizov · Aug 3, 2018 4m read

Hi all. Today we are going to install Jupyter Notebook and connect it to Apache Spark and InterSystems IRIS.

Note: I have done the following on Ubuntu 18.04,  Python 3.6.5.

Introduction

If you are looking for well-known, widely-spread and mainly popular among Python users notebook instead of Apache Zeppelin, you should choose Jupyter notebook. Jupyter notebook is a very powerful and great data science tool. it has a big community and a lot of additional software and integrations.

1
2 4702
Article Niyaz Khafizov · Jul 27, 2018 4m read

Hi all. Today we are going to upload a ML model into IRIS Manager and test it.

Note: I have done the following on Ubuntu 18.04, Apache Zeppelin 0.8.0, Python 3.6.5.

Introduction

These days many available different tools for Data Mining enable you to develop predictive models and analyze the data you have with unprecedented ease. InterSystems IRIS Data Platform provide a stable foundation for your big data and fast data applications, providing interoperability with modern DataMining tools. 

In this series of articles we explore Data mining capabilities available with InterSystems IRIS.

2
2 1545
Article Niyaz Khafizov · Jul 19, 2018 4m read

Hi all. Today we are going to use k-means algorithm on the Iris Dataset.

Note: I have done the following on Ubuntu 18.04, Apache Zeppelin 0.8.0, python 3.6.5.

Introduction

K-Means is one of the simplest unsupervised learning algorithms that solves the clustering problem. It groups all the objects in such a way that objects in the same group (group is a cluster) are more similar (in some sense) to each other than to those in other groups. For example, assume you have an image with a red ball on the green grass. K-Means will split all pixels into two clusters.

0
3 9821
Article Niyaz Khafizov · Jul 6, 2018 3m read

Hi all. Yesterday I tried to connect Apache Spark, Apache Zeppelin, and InterSystems IRIS. During the process, I experienced troubles connecting it all together and I did not find a useful guide. So, I decided to write my own.

Introduction

What is Apache Spark and Apache Zeppelin and find out how it works together. Apache Spark is an open-source cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. So, it is very useful when you need to work with Big Data.

1
1 1980