Hi all. We are going to find duplicates in a dataset using Apache Spark Machine Learning algorithms.
Note: I have done the following on Ubuntu 18.04, Python 3.6.5, Zeppelin 0.8.0, Spark 2.1.1
Introduction
In previous articles we have done the following: