4 Univariate LDA from Scratch

(ns assignment.lda-univariate
  (:require
    [assignment.generate-data :refer [data]]
    [fastmath.stats :as stats]
    [scicloj.ml.core :as ml]
    [scicloj.ml.dataset :as ds]
    [tech.v3.datatype.functional :as dfn]))

4.1 Split data

Use :holdout for a simple 1-split data.

(def train-test
  (ds/split->seq (ds/drop-columns data :x2)
                 :holdout {:ratio [0.8 0.2] :seed 0}))

(def training (:train (first train-test)))

(def testing (:test (first train-test)))

4.2 Create dataset with key data

Our key data includes mean, pooled-variance, and probabilities (called prior-prob) per group.

(def pooled-variance
  (-> training
      (ds/group-by [:group]
                   {:result-type :as-map})
      vals
      (->> (map :x1))
      stats/pooled-variance))

(def grouped-data
  (-> training
      (ds/group-by [:group])
      (ds/aggregate {:count #(count (% :x1))
                     :mean  #(dfn/mean (% :x1))})
      ;:variance #(stats/variance (% :x1))})
      (ds/add-column :pooled-variance pooled-variance)
      (ds/map-columns :prior-prob [:count] #(dfn// % (ds/row-count training)))
      (ds/select-columns #{:group :mean :pooled-variance :prior-prob})))

4.2.1 View grouped data

grouped-data

_unnamed [3 4]:

:group	:mean	:pooled-variance	:prior-prob
normal	3.17443507	3.91103088	0.3271
log-normal	-0.94902020	3.91103088	0.3479
gamma	7.18921057	3.91103088	0.3250

4.3 Create our key function

In order to classify our data, we need to compare discriminate scores. In univariate linear discriminant analysis, the calculation is much simplified as compared to multivariate, which involves matrices.

(defn discriminant-score [x mu var pi]
  (+
    (- (* x (/ mu var)) (/ (Math/pow mu 2) (* 2 var)))
    (Math/log pi)))

4.4 Implement labeling

This is our classifying function.

(defn map-predict [dat]
  (-> (map
        (fn [data-point]
          (-> grouped-data
              (ds/map-columns
                :predict
                (ds/column-names grouped-data #{:mean :pooled-variance :prior-prob})
                (fn [mu var pi]
                  (discriminant-score data-point mu var pi)))
              (ds/order-by :predict :desc)
              (ds/select :group 0)
              :group
              vec))
        dat)
      ds/dataset
      (ds/rename-columns 0 {0 :predict})))

4.4.1 How’s map-predict work?

Given a sequence of numbers, we are finding the highest discriminant score and labeling the data points that corresponding group. For example, let’s sniff test our expectations, such as data near 0 is log-normal, data near 3 is normal, and data near 7 is gamma.

grouped-data

_unnamed [3 4]:

:group	:mean	:pooled-variance	:prior-prob
normal	3.17443507	3.91103088	0.3271
log-normal	-0.94902020	3.91103088	0.3479
gamma	7.18921057	3.91103088	0.3250

(map-predict [0 1 2 3 4 5 6 7])

:_unnamed [8 1]:

:predict
log-normal
log-normal
normal
normal
normal
normal
gamma
gamma

The results seem close to expectations.

4.5 Predictions vs Actual

4.5.1 Training

(def pred-train
  (let [data (vec (:x1 training))]
    (vec (:predict (map-predict data)))))

(def actual-train
  (vec (:group training)))

(ml/confusion-map->ds (ml/confusion-map pred-train actual-train :none))

_unnamed [4 4]:

:column-name	gamma	log-normal	normal
column-name	gamma	log-normal	normal
gamma	135	0.000	21
log-normal	0.000	141	26
normal	25	22	110

Confusion matrices made in Clojure’s machine learning library (scicloj) according to the prescribed order (predicted-labels [true-]labels), have the TRUE classes horizontally. The columns represent the prediction values.

(ml/classification-accuracy pred-train actual-train)

0.8041666666666667

(stats/cohens-kappa pred-train actual-train)

0.7061926157452628

(stats/mcc pred-train actual-train)

0.7062662112546088

4.5.2 Test

(def pred-test
  (let [dat (vec (:x1 testing))]
    (vec (:predict (map-predict dat)))))

(def actual-test
  (vec (:group testing)))

(ml/confusion-map->ds (ml/confusion-map pred-test actual-test :none))

_unnamed [4 4]:

:column-name	gamma	log-normal	normal
column-name	gamma	log-normal	normal
gamma	33	0.000	11
log-normal	0.000	29	4
normal	5	6	32

(ml/classification-accuracy pred-test actual-test)

0.7833333333333333

(stats/cohens-kappa pred-test actual-test)

0.6733668341708543

(stats/mcc pred-test actual-test)

0.6753465079609335

Performance on test is less than training. This is not unexpected, in fact it’s quite common. Test data statistic may be better than the training data in cases where 1) the data is rather normal (no outliers), 2) the model is most appropriate to the data (e.g. a linear relation is modeled linearly) and/or 3) data size is small allowing for higher variation of these statistics.

As for this model, having Cohen’s Kappa and Mathews Correlation Coefficient greater than .6 is encouraging if only because, if you recall our :x1 distributions by group (final plot in our “Visualize Data” section), there is heavy overlap between our three groups.

source: src/assignment/lda_univariate.clj