5  Linear Discriminate Analysis

(ns assignment.lda
  (:require
    [assignment.generate-data :refer [data]]
    [calc-metric.patch]
    [fastmath.stats :as stats]
    [scicloj.ml.core :as ml]
    [scicloj.ml.dataset :as ds]
    [scicloj.ml.metamorph :as mm]
    [utils.helpful-extracts :refer [eval-maps model->ds]]))
(def response :group)
(def regressors
  (remove #{response} (ds/column-names data)))

5.1 Build pipelines

5.1.1 Input transforming pipelines

In order for :smile.classification to work, categorical data needs to be transformed to numeric.

(def pipeline-fn
  (ml/pipeline
    (mm/categorical->number [response])
    (mm/set-inference-target response)))
(def pipeline-std-fn
  (ml/pipeline
    (mm/std-scale regressors {})
    (mm/categorical->number [response])
    (mm/set-inference-target response)))

5.1.2 Model building pipelines

(ml/hyperparameters
  :smile.classification/linear-discriminant-analysis)
nil

No hyperparameters.

(defn lda-piping-fn [pipeline]
  (ml/pipeline
    pipeline
    {:metamorph/id :model}
    (mm/model
      {:model-type :smile.classification/linear-discriminant-analysis})))

5.1.3 Input_Transform->Model_Building pipelines

(def lda-pipe-fn
  (lda-piping-fn pipeline-fn))
(def lda-std-pipe-fn
  (lda-piping-fn pipeline-std-fn))

5.1.3.1 View output of a fitted-pipeline

(-> data
    (ml/transform-pipe lda-std-pipe-fn
                       (ml/fit-pipe data lda-std-pipe-fn))
    :metamorph/data
    ds/shuffle
    ds/head)

:_unnamed [5 4]:

log-normal normal gamma :group
0.03088574 0.19141317 0.77770108 2.0
0.08256977 0.28728216 0.63014807 2.0
0.66053978 0.28541597 0.05404425 0.0
0.24198183 0.55287144 0.20514674 1.0
0.04956847 0.31746044 0.63297109 2.0

5.2 Partition data

(def train-test
  (ds/split->seq data :bootstrap {:repeats 30}))

Clojure’s default :bootstrapping process takes a :repeats argument that is equivalent to b, number of bootstraps. Its training data proportion is determined by :ratio, whose default is 1. The test data is the out-of-bag data, which would include the 1 - ratio data when :ratio is not 1.

5.3 Evaluate pipes

(def evaluate-pipes
  (ml/evaluate-pipelines
    [lda-pipe-fn lda-std-pipe-fn]
    train-test
    stats/cohens-kappa
    :accuracy
    {:other-metrices            [{:name :accuracy
                                  :metric-fn ml/classification-accuracy}
                                 {:name :mathews-cor-coef
                                  :metric-fn stats/mcc}]
     :return-best-pipeline-only false}))

5.4 Extract models

(def models
  (->> evaluate-pipes
       flatten
       (map
         #(hash-map :summary (ml/thaw-model (get-in % [:fit-ctx :model]))
                    :fit-ctx (:fit-ctx %)
                    :timing-fit (:timing-fit %)
                    :metric ((comp :metric :test-transform) %)
                    :other-metrices ((comp :other-metrices :test-transform) %)
                    :params ((comp :options :model :fit-ctx) %)
                    :pipe-fn (:pipe-fn %)))
       (sort-by :metric)))

5.4.1 View model stats

(count models)
2
(-> models first :metric)
0.9018734442552077
(-> models first :other-metrices
    (->> (map #(select-keys % [:name :metric]))))
({:name :accuracy, :metric 0.9345794392523364}
 {:name :mathews-cor-coef, :metric 0.9027015644752848})
(-> models second :metric)
0.9018734442552077
(-> models second :other-metrices
    (->> (map #(select-keys % [:name :metric]))))
({:name :accuracy, :metric 0.9345794392523364}
 {:name :mathews-cor-coef, :metric 0.9027015644752848})

Two models with exactly the same statistics. Meaning in this particular case, scaling and normalizing our data was not required for an improvement in the classification of groups.

(-> models first :fit-ctx second)
[#uuid "3c689c6c-0e6f-46d8-8e97-faa192819922"
 {:fit-std-xform
  {:x1
   {:mean 3.0724771412482723, :standard-deviation 3.7718629995876034},
   :x2
   {:mean 0.05383571673872441, :standard-deviation 6.804812311672842}}}]

look for :fit-ctx second has StdScaleTransform Notice in our first model’s :fit-ctx we have a :fit-std-xform. That means this is our standardized pipeline. Might be interesting to keep this in mind for the next table.

(-> (model->ds (eval-maps models 2))
    (ds/rename-columns {:metric-1 :kappa                    ;TODO: extract from models
                        :metric-2 :accuracy
                        :metric-3 :mathews-cor-coef}))

_unnamed [2 5]:

:model-type :compute-time-ns :kappa :accuracy :mathews-cor-coef
:smile.classification/linear-discriminant-analysis 22704584 0.90187344 0.93457944 0.90270156
:smile.classification/linear-discriminant-analysis 10257939 0.90187344 0.93457944 0.90270156

In Clojure, these metrics are rated on our :test data which is embedded in the partition data, train-test, and extracted in variable models.

Everything’s the same except compute time.

5.5 Evaluations

Above we can see our models’ statistic on the test data. We might want to see how the best model fits on the full data.

(def predictions
  (-> data
      (ml/transform-pipe
        lda-pipe-fn
        (-> models first :fit-ctx))
      :metamorph/data
      :group
      vec))
(def actual
  (-> data
      (ml/fit-pipe lda-pipe-fn)
      :metamorph/data
      :group
      vec))

The actual variable looks like we are fitting a model, however, the code is running our data through the input-transforming pipeline as to get the appropriate mapping between group category and its respective numerical coding.

(ml/confusion-map->ds (ml/confusion-map predictions actual :none))

_unnamed [4 4]:

:column-name 0 1 2
column-name 0 1 2
0 185 15 0.000
1 15 75 110
2 0.000 0.000 200
(-> models second :fit-ctx :model
    :target-categorical-maps :group :lookup-table)
{"log-normal" 0, "normal" 1, "gamma" 2}
(ml/classification-accuracy predictions actual)
0.7666666666666667
(stats/cohens-kappa predictions actual)
-0.49999999999999994
(stats/mcc predictions actual)
-0.5272904491779212

Woah! Something is wrong with the calculations. Let’s see the datatypes:

(type (first predictions))
java.lang.Double
(type (first actual))
java.lang.Long

One is a long type the other is a type double. These are not the same, which is why our kappa and mcc were so horribly low. Notice different datatypes’ equivalencies and identities.

(= [1] `(1))
true
(identical? [1] `(1))
false

Vectors [] are equivalent to lists `() (both in the sequence partition of Clojure.core), but not identical.

(= 1 1)
true
(= 1 1.0)
false
(identical? 1 1.0)
false

But long 1 and double 1 are neither equivalent nor identical.

I will map each more precise type (double) to the less granular type (long) as to ensure we are calculating the stats properly.

(def predictions
  (vec (map #(long %) predictions)))
(ml/classification-accuracy predictions actual)
0.7666666666666667
(stats/cohens-kappa predictions actual)
0.65
(stats/mcc predictions actual)
0.6854775839312975

It’s better, however, it is interesting to see that adding an additional predictor, :x2, we aren’t getting a bump in performance based on kappa and mcc.

I want to see if it has anything to do with the fact that these categories are still coded numerically.

(def lookup-table
  (-> models second :fit-ctx :model
      :target-categorical-maps :group :lookup-table))
(def lookup-table-invert
  (clojure.set/map-invert lookup-table))
(def lda-predict
  (->> predictions
      (map #(get lookup-table-invert %))
      vec))
(peek lda-predict)
"log-normal"

Now we have predictions labeled with their respective string values. Let’s see how they compare to our original data :groups.

(ml/classification-accuracy lda-predict (:group data))
0.7666666666666667
(stats/cohens-kappa lda-predict (:group data))
0.65
(stats/mcc lda-predict (:group data))
0.6854775839312975

Still the same. Interesting the predicted being weaker with the additional variable. My guess is because of the numbers. Here, we tested on the full data, where in the univariate LDA, we tested on 20% of that data, having higher variability of good or poor metrics (apparently, we go the good side in this assignment).

source: src/assignment/lda.clj