This is the second post of a series explaining how to create an end-to-end Machine Learning system.

Exploring Data

The InterSystems IRIS already has what we need to explore the data: an SQL Engine! For people who used to explore data in
csv or text files this could help to accelerate this step. Basically we explore all the data to understand the intersection
(joins) which should help to create a dataset prepared to be used by a machine learning algorithm.

1 0
1 361

Hi all. We are going to find duplicates in a dataset using Apache Spark Machine Learning algorithms.

Note: I have done the following on Ubuntu 18.04, Python 3.6.5, Zeppelin 0.8.0, Spark 2.1.1

Introduction

In previous articles we have done the following:

0 0
1 770

Apache Spark has rapidly become one of the most exciting technologies for big data analytics and machine learning. Spark is a general data processing engine created for use in clustered computing environments. Its heart is the Resilient Distributed Dataset (RDD) which represents a distributed, fault tolerant, collection of data that can be operated on in parallel across the nodes of a cluster. Spark is implemented using a combination of Java and Scala and so comes as a library that can run on any JVM.

11 5
1 2.8K

Hello!

My group and I are currently doing a research project on natural language processing and iKnow plays a big role in this project. I am aware that the algorithms iKnow use aren't public, and I respect that.

My question is, are there any public documents/research that explains, at least part of, the algorthims iKnow uses and the motivations for using them?

1 2
0 464