Article
· Sep 5, 2023 11m read

QuinielaML - Data Preparation and Predictive Model Management.

We continue with the series of articles related to the QuinielaML application. In this article we are going to cover how to prepare the raw data that we have captured using the Embedded Python functionality.

Welcome everybody!

Happy Artist GIF by VIRTUTE - Find & Share on GIPHY

Introduction

If you remember the previous article, we have captured the First and Second Division football results data for the last 23 years using Embedded Python from an external website. Well, now that we have the raw data we will have to transform and prepare it to help both the maintenance of the application and our prediction model to work with them.

QUINIELA_Object.RawMatch

Let's see what form the data that we have captured in our IRIS database has:

As you can see in the following screenshot, they do not differ much from the information present on the BDFutbol website:

 

Master tables:

In order to facilitate subsequent maintenance of common data as well as improve the performance of the prediction model, we have defined 2 master tables in which we will store the teams and referees. These tables will only consist of the column with the record identifier and the name. Let's take a look at both:

QUINIELA_Object.Referee:

Corresponding to the master table of referees.

QUINIELA_Object.Team

Corresponding to the master table of referees.

 

Data preparation

With our master tables and our raw data we can now begin the process of preparing the data to train our model. Here we have the Business Operation that is going to be in charge of the preparation:

Class QUINIELA.BO.PrepareBO Extends Ens.BusinessOperation
{

Parameter INVOCATION = "Queue";
Method PrepareData(pRequest As QUINIELA.Message.PrepareRequest, pResponse As QUINIELA.Message.PrepareResponse) As %Status
{
    Set sc = $$$OK
    set pResponse = ##class(QUINIELA.Message.PrepareResponse).%New()
    set pResponse.Operation = pRequest.Operation
    
    set sqlTruncateTrain = "TRUNCATE TABLE QUINIELA_Object.MatchTrain"
    set statementTruncateTrain = ##class(%SQL.Statement).%New()
    set statusTruncateTrain = statementTruncateTrain.%Prepare(sqlTruncateTrain)
    if ($$$ISOK(statusTruncateTrain)) {
        set resultSetTruncateTrain = statementTruncateTrain.%Execute()
        if (resultSetTruncateTrain.%SQLCODE = 0) {
            set sqlMatchTrain = "INSERT INTO QUINIELA_Object.MatchTrain (Day, Division, Journey, LocalTeam, Referee, Result, VisitorTeam, IntDay) "_
                "SELECT "_
                "TO_DATE(RM.Day,'DD/MM/YYYY') AS DayTransformed, "_
                "RM.Division, "_
                "RM.Journey, "_
                "LT.ID as LocalTeam, "_
                "R.ID as Referee, "_
                "CASE WHEN CAST(RM.GoalsLocal As INTEGER) > CAST(RM.GoalsVisitor As INTEGER) THEN 1 WHEN CAST(RM.GoalsLocal As INTEGER) < CAST(RM.GoalsVisitor As INTEGER) THEN 2 ELSE 0 END as Result, "_
                "VT.ID as VisitorTeam, "_
                "CAST({fn CONCAT({fn CONCAT(SUBSTR(RM.Day,7,4),SUBSTR(RM.Day,4,2))},SUBSTR(RM.Day,1,2))} As INTEGER) as IntDay "_
                "FROM "_
                "QUINIELA_Object.RawMatch RM "_
                "LEFT JOIN QUINIELA_Object.Team LT ON UPPER(RM.LocalTeam) = UPPER(LT.Name) "_
                "LEFT JOIN QUINIELA_Object.Team VT ON UPPER(RM.VisitorTeam) = UPPER(VT.Name) "_
                "LEFT JOIN QUINIELA_Object.Referee R ON UPPER(RM.Referee) = UPPER(R.Name)"
            set statementMatchTrain = ##class(%SQL.Statement).%New()
            set statusMatchTrain = statementMatchTrain.%Prepare(sqlMatchTrain)
            if ($$$ISOK(statusMatchTrain)) {
                set resultSetMatchTrain = statementMatchTrain.%Execute()
                if (resultSetMatchTrain.%SQLCODE = 0) {
                    set sqlUpdateLocalStreak = "UPDATE QUINIELA_Object.MatchTrain SET QUINIELA_Object.MatchTrain.LocalStreak = "_
                        "(SELECT SUM(CASE WHEN IsVictory = 1 THEN 4-%VID ELSE 0 END) FROM "_
                        "(SELECT TOP 3 SubMatch.IntDay, "_
                        "CASE WHEN Result = 1 THEN 1 ELSE 0 END AS IsVictory "_
                        "FROM QUINIELA_Object.MatchTrain AS SubMatch "_
                        "WHERE "_
                        "UPPER(SubMatch.LocalTeam) = UPPER(QUINIELA_Object.MatchTrain.LocalTeam) "_
                        "AND SubMatch.IntDay < QUINIELA_Object.MatchTrain.IntDay "_
                        "ORDER BY SubMatch.IntDay DESC)) "
                    set statementUpdateLocalStreak = ##class(%SQL.Statement).%New()
                    set statusUpdateLocalStreak = statementUpdateLocalStreak.%Prepare(sqlUpdateLocalStreak)
                    if ($$$ISOK(statusUpdateLocalStreak)) {
                        set resultSetUpdateLocalStreak = statementUpdateLocalStreak.%Execute()
                        if (resultSetUpdateLocalStreak.%SQLCODE = 0) {
                            set sqlUpdateVisitorStreak = "UPDATE QUINIELA_Object.MatchTrain SET QUINIELA_Object.MatchTrain.VisitorStreak = "_
                                "(SELECT SUM(CASE WHEN IsVictory = 1 THEN 4-%VID ELSE 0 END) FROM "_
                                "(SELECT TOP 3 SubMatch.IntDay, "_
                                "CASE WHEN Result = 2 THEN 1 ELSE 0 END AS IsVictory "_
                                "FROM QUINIELA_Object.MatchTrain AS SubMatch "_
                                "WHERE "_
                                "UPPER(SubMatch.VisitorTeam) = UPPER(QUINIELA_Object.MatchTrain.VisitorTeam) "_
                                "AND SubMatch.IntDay < QUINIELA_Object.MatchTrain.IntDay "_
                                "ORDER BY SubMatch.IntDay DESC)) "
                            set statementUpdateVisitorStreak = ##class(%SQL.Statement).%New()
                            set statusUpdateVisitorStreak = statementUpdateVisitorStreak.%Prepare(sqlUpdateVisitorStreak)
                            if ($$$ISOK(statusUpdateVisitorStreak)) {
                                set resultSetUpdateVisitorStreak = statementUpdateVisitorStreak.%Execute()
                                set sc = statusUpdateVisitorStreak
                            }
                            else {
                                set sc = statusUpdateVisitorStreak
                            }
                        }
                    }
                    else {
                        set sc = statusUpdateLocalStreak
                    }
                }
            }
            else {
                set sc = statusMatchTrain
            }
        }
    }
    
    set pResponse.Status = "Finished"
    Return sc
}

XData MessageMap
{
<MapItems>
  <MapItem MessageType="QUINIELA.Message.PrepareRequest">
    <Method>PrepareData</Method>
  </MapItem>
</MapItems>
}

}

Let's now look in detail at each of the SQL statements we are running:

  1. We drop the training data table:
    TRUNCATE TABLE QUINIELA_Object.MatchTrain
  2. We launch a massive insert into our training table QUINIELA_Object.MatchTrain
    INSERT INTO QUINIELA_Object.MatchTrain (Day, Division, Journey, LocalTeam, Referee, Result, VisitorTeam, IntDay) 
        SELECT
            TO_DATE(RM.Day,'DD/MM/YYYY') AS DayTransformed, 
            RM.Division, 
            RM.Journey, 
            LT.ID as LocalTeam, 
            R.ID as Referee, 
            CASE WHEN CAST(RM.GoalsLocal As INTEGER) > CAST(RM.GoalsVisitor As INTEGER) THEN 1 WHEN CAST(RM.GoalsLocal As INTEGER) < CAST(RM.GoalsVisitor As INTEGER) THEN 2 ELSE 0 END as Result, 
            VT.ID as VisitorTeam, 
            CAST({fn CONCAT({fn CONCAT(SUBSTR(RM.Day,7,4),SUBSTR(RM.Day,4,2))},SUBSTR(RM.Day,1,2))} As INTEGER) as IntDay 
        FROM
            QUINIELA_Object.RawMatch RM 
            LEFT JOIN QUINIELA_Object.Team LT ON UPPER(RM.LocalTeam) = UPPER(LT.Name) 
            LEFT JOIN QUINIELA_Object.Team VT ON UPPER(RM.VisitorTeam) = UPPER(VT.Name) 
            LEFT JOIN QUINIELA_Object.Referee R ON UPPER(RM.Referee) = UPPER(R.Name)
    To do this, we replace the literal with the name of the referee and the teams with their reference in the master tables. We also obtain the result from the goals of the match, 0 for the draw, 1 for the home victory and 2 for the visitor. This Result column will be the one that defines our prediction model as a classification model, that is, each match will belong to one of our 3 classes (1, X or 2).  
  3. We calculate the streak for each team depending on whether they play at home or away. We have added these columns to improve, as much as possible, the performance of the predictive model. We have assumed that a team that has several consecutive victories at home or away is easier to continue with the victories by being "on a streak." The calculation is done in the following way, the last 3 games are obtained (at home for the team that plays at home or away for the one that plays as a visitor), if it has won the last game it is assigned 3 points, if it has won in second to last is assigned 2 and if the second to last has won it is assigned 1, finally the points obtained are added giving a numerical value to the streak. Home streak:
    UPDATE QUINIELA_Object.MatchTrain SET QUINIELA_Object.MatchTrain.LocalStreak = 
        (SELECT SUM(CASE WHEN IsVictory = 1 THEN 4-%VID ELSE 0 END) FROM
            (SELECT TOP 3 SubMatch.IntDay, 
                CASE WHEN Result = 1 THEN 1 ELSE 0 END AS IsVictory 
            FROM QUINIELA_Object.MatchTrain AS SubMatch 
            WHERE
                UPPER(SubMatch.LocalTeam) = UPPER(QUINIELA_Object.MatchTrain.LocalTeam) 
                AND SubMatch.IntDay < QUINIELA_Object.MatchTrain.IntDay 
            ORDER BY SubMatch.IntDay DESC
            )
        )
    Visitor streak:
    UPDATE QUINIELA_Object.MatchTrain SET QUINIELA_Object.MatchTrain.VisitorStreak = 
        (SELECT SUM(CASE WHEN IsVictory = 1 THEN 4-%VID ELSE 0 END) 
        FROM
            (SELECT TOP 3 SubMatch.IntDay, 
                CASE WHEN Result = 2 THEN 1 ELSE 0 END AS IsVictory 
            FROM QUINIELA_Object.MatchTrain AS SubMatch 
            WHERE
                UPPER(SubMatch.VisitorTeam) = UPPER(QUINIELA_Object.MatchTrain.VisitorTeam) 
                AND SubMatch.IntDay < QUINIELA_Object.MatchTrain.IntDay 
                ORDER BY SubMatch.IntDay DESC
            )
        )
     

Let's see what the result of this set of insertions and updates consulting the table QUINIELA_Object.MatchTrain:

As you can see we have transformed the text fields to numerical values...or we haven't? Let's look at the class definition:

Class QUINIELA.Object.MatchTrain Extends (%Persistent, %JSON.Adaptor) [ DdlAllowed ]
{

/// Day of the match
Property Day As %Date;
/// Local Team
Property LocalTeam As %String;
/// Visitor Team
Property VisitorTeam As %String;
/// Local Streak
Property LocalStreak As %Integer;
/// Visitor Streak
Property VisitorStreak As %Integer;
/// Referee
Property Referee As %String;
/// Result
Property Result As %String;
/// Division
Property Division As %String;
/// Journey
Property Journey As %String;
/// Integer day
Property IntDay As %Integer;
}

As you can see, the references to the master tables continue to be of type %String. What is the reason for this? Well, you can find the explanation on this page of the documentation, but in summary it is because, although they are really numerical, they do not correspond to quantifiable values, but to identifiers.

Perfect, we now have everything to create and train our predictive model.

 

Creation and training of the prediction model

Thanks to the functionality of IntegratedML this step is extremely simple for us, since we will only need to execute 2 simple commands in our database. Let's take a look at the Business Operation that we have created for this:

Class QUINIELA.BO.TrainBO Extends Ens.BusinessOperation
{

Parameter INVOCATION = "Queue";
/// Description
Method CreateAndTrainModel(pRequest As QUINIELA.Message.TrainRequest, pResponse As QUINIELA.Message.TrainResponse) As %Status
{
        Set tSC = $$$OK
        set pResponse = ##class(QUINIELA.Message.TrainResponse).%New()
        set pResponse.Operation = pRequest.Operation
        set pResponse.Status = "In Process"
        set sql = "SELECT MODEL_NAME FROM INFORMATION_SCHEMA.ML_MODELS WHERE MODEL_NAME = 'QuinielaModel'"
        set statement = ##class(%SQL.Statement).%New()
        set status = statement.%Prepare(sql)
        $$$TRACE(status)
        if ($$$ISOK(status)) {
            set resultSet = statement.%Execute()
            $$$TRACE(resultSet.%SQLCODE)
            if (resultSet.%SQLCODE = 0) {
                while (resultSet.%Next() '= 0) {
                    if (resultSet.%GetData(1) '= "") {
                        set sqlDrop = "DROP MODEL QuinielaModel"
                        set statementDrop = ##class(%SQL.Statement).%New()
                        set statusDrop = statementDrop.%Prepare(sqlDrop)
                        if ($$$ISOK(statusDrop)) {
                            set resultSetDrop = statementDrop.%Execute()
                            if (resultSetDrop.%SQLCODE = 0) {
                                set tSC = statusDrop                                                                
                            }
                        }
                    }
                }
            }            
        }
        $$$TRACE("Creating model")
        set sqlCreate = "CREATE MODEL QuinielaModel PREDICTING (Result) FROM QUINIELA_Object.MatchTrain"
        set statementCreate = ##class(%SQL.Statement).%New()
        set statusCreate = statementCreate.%Prepare(sqlCreate)
        if ($$$ISOK(statusCreate)) {
            $$$TRACE("Model created")
            set resultSetCreate = statementCreate.%Execute()
            if (resultSetCreate.%SQLCODE = 0) {
                set tSC = statusCreate                                
            }
        }
        else
        {
            set tSC = statusDrop
        }

        $$$TRACE("Training model")
        set sqlTrain = "TRAIN MODEL QuinielaModel"
        set statementTrain = ##class(%SQL.Statement).%New()
        set statusTrain = statementTrain.%Prepare(sqlTrain)
        if ($$$ISOK(statusTrain)) {
            set resultSetTrain = statementTrain.%Execute()
            if (resultSetTrain.%SQLCODE = 0) {
                // VALIDATION OF THE MODEL WITH THE PRE-LOADED MATCHES
                set sqlValidate = "VALIDATE MODEL QuinielaModel FROM QUINIELA_Object.MatchTrain"
                set statementValidate = ##class(%SQL.Statement).%New()
                set statusValidate = statementValidate.%Prepare(sqlValidate)
                set resultSetValidate = statementValidate.%Execute()
                set tSC = statusValidate                                    
            }
        }
        else {
            set tSC = statusTrain
        }
        
        set pResponse.Status = "Finished"
        Return tSC
}

XData MessageMap
{
<MapItems>
  <MapItem MessageType="QUINIELA.Message.TrainRequest">
    <Method>CreateAndTrainModel</Method>
  </MapItem>
</MapItems>
}

}

Let's analyze what our BO is doing:

  1. Creation of the predictive model and definition of the value to predict indicating the column of our corresponding training table.
    CREATE MODEL QuinielaModel PREDICTING (Result) FROM QUINIELA_Object.MatchTrain
  2. Training our newly created model.
    TRAIN MODEL QuinielaModel
  3. Validation of the model created on the training table used.
    VALIDATE MODEL QuinielaModel FROM QUINIELA_Object.MatchTrain

With these three very simple steps we would have our model ready to start generating predictions. Let's take a look at the quality of our model, to do this we will execute the following query:

SELECT * FROM INFORMATION_SCHEMA.ML_VALIDATION_METRICS

With this query we will obtain the following metrics:

Well, not so bad, for the local victory they are correct 52% of the time, for the visiting victory 41% and for the draws 37%, we are above the 33% success rate due to pure chance!

Boxing Memes on X: "Golovkin's fellow country man reacts to his victory...  #GolovkinGeale http://t.co/MDW6F5eJlz" / X

Discussion (1)1
Log in or sign up to continue