Fall2024CS584 · clara-aparicio · Oct 11, 2024 · Oct 11, 2024 · Oct 11, 2024
diff --git a/README.md b/README.md
@@ -1,8 +1,129 @@
-# Project 1 
+#### Authors
+- Clara Aparicio Mendez (A20599326)
+- Juan Cantarero Angulo (A20598593)
+- Raquel Gimenez Pascual (A20599725)
+- Carlota Ruiz de Conejo de la Sen (A20600262)
+# Regularized Discriminant Analysis (RDA) Model
 
-Put your README here. Answer the following questions.
+## Introduction
 
-* What does the model you have implemented do and when should it be used?
-* How did you test your model to determine if it is working reasonably correctly?
-* What parameters have you exposed to users of your implementation in order to tune performance? (Also perhaps provide some basic usage examples.)
-* Are there specific inputs that your implementation has trouble with? Given more time, could you work around these or is it fundamental?
+ - **Description**: The model developed is a classifer based on the Regularized Discriminant Analysis (RDA). The RDA algorithm is a compromise between the Linear Discriminant Analysis (LDA) and the Quadratic Discriminant Analysis (QDA) that introduces a regularization term to improve stability and performance by shrinking the covariance matrices. This flexibility allows RDA to smoothly transition between LDA and QDA, depending on the level of regularization applied. 
+
+ - **Use**: RDA is particularly useful in multiclass classification, especifically when dealing with high-dimensional data, where many features are potentially correlated. RDA helps prevent overfitting and provides a balance between LDA's simple linear boundaries and QDA's quadratic ones, making it more adaptable to different data distributions.
+
+## Model Testing
+
+- **Testing methods**: To test the model, we have generated synthetic data testing with different possible dimentions. Each class is represented by a multivariate normal distribution with a defined number of samples. Additionally, a visualization function was implemented to observe the distribution of the generated data. This allows us to visually inspect the model's performance and the impact of the regularization parameters. Furthermore, we have corroborate the performance of our model implementing a confusion matrix, where we can visualize the results of the model. Thanks to this approach our model has an accuracy higher thatn 99%.
+
+## Parameter for Tuning Performance
+
+To control the balance between LDA and QDA, the parameter &alpha; must be adjusted:
+
+- When **&alpha; = 0**, the model behaves like LDA, using a single shared covariance matrix across all classes. 
+- When **&alpha; = 1**, the model behaves like QDA, with each class having its own distinct covariance matrix.
+
+To better understand the regularization process used in the model, the following formula is applied:
+
+$$
+\hat{\Sigma}_k(\alpha) = \alpha \hat{\Sigma}_k + (1 - \alpha) \hat{\Sigma}
+$$
+
+Where:
+<ul>
+  <li>&#x3A3;<sub>k</sub> is the covariance matrix for class <em>k</em>.</li>
+  <li>&#x3A3; is the pooled covariance matrix, shared across all classes.</li>
+  </ul>  
+
+## Structure
+
+The project is divided into three main parts. 
+
+- **Data Generation** : The function *generate_classifier_data()* is implemented for creating synthetic data with specific centroids and covariance matrices. This function allows users to test the RDA model by simulating various data distributions.
+
+- **RDA Model**:
+
+  - *Class RDAClassifier()*: 
+
+    - **Fitting**: the method *fit()* estimates the class-specific means and covariances to obtain both, the shared covariance matrix and the specifics, afterwards it applies regularization. 
+
+  - *Class RDAClassifierResult()*:
+
+    - **Log-Likelihood**: the method *_log_loss()* calculates the log-likelihood for each data point belonging to a given class based on the regularized covariance matrices to quantify how likely each point is to belong to a given class. The lowest is the log-loss the higher is the probability of belonging to a class 
+
+    - **Prediction**: the method *predict()* calculates the probability of each data point belonging to all classes based on the class-specific and regularized covariance matrices. It returns the class with the highest score for each point and is used for visualizing predictions in 2D. Additionally, the method *predict_LL()* estimates the log-loss for each point, providing a numerical measure of its likelihood for each class, which is then expressed as a probability.
+
+    - **Visualization**: the method *viz_everything()* first, projects the data points onto the two most significant eigenvectors of the class-specific covariance matrices. Afterwards, this function provides a visual representation of how the model separates the classes. The colour given to the classified points is chosen according to their predicted class.
+
+
+- **Test**: 
+
+  - *Synthetic data*: The model has been tested using synthetic data generated with varying centroids and covariance matrices. Additionally, it has been evaluated across different dimensional settings, accommodating both N classes and N features to evaluate its performance under various conditions 
+
+  - *Visualization*: The generated classifications are visualized to verify the correct behavior of the model reducing the dimensions of the features to 2D. 
+
+  - *Confusion Matrix*: To evaluate the accuracy of the classification, we implemented a confusion matrix that clearly shows which data points have been correctly classified and where misclassifications occurred. 
+
+## Basic Usage Examples
+
+Quick example of how to generate data, fit the model, and visualize the classification:
+
+1. **Generate Synthetic Data**: The following examples create synthetic data using specified centroids and covariance matrices to simulate different class distributions.
+
+
+   ```python
+    c_1=numpy.array([-1,1])
+    c_2=numpy.array([2,1])
+    c_3=numpy.array([0,-2.5])
+    c_4=numpy.array([-3,-3])
+    centroid_list = [c_1,c_2,c_3,c_4]
+    sigma_1 = numpy.array([[1,1],[0,1]])
+    sigma_2 = numpy.array([[0.7, 0.5], [0.5, 1.2]])  
+    sigma_3 = numpy.array([[1.0, 0.0], [0.0, 1.0]])  
+    sigma_4 = numpy.array([[1.1, 0.3], [0.3, 0.5]]) 
+    sigma_list = [sigma_1,sigma_2,sigma_3,sigma_4]
+    nsamples_1 = 500
+    nsamples_2 = 500
+    nsamples_3 = 500
+    nsamples_4 =350
+    nsamples_list = [nsamples_1,nsamples_2,nsamples_3,nsamples_4]
+    xs, ys, n_classes = generate_classifier_data(centroid_list,sigma_list,nsamples_list, 73784)
+    ```
+
+2. **Fit the RDA Model**: Fit the model to the generated data, adjusting the regularization parameter 
+
+    ```python 
+    alpha = 0.7
+    model = LDAClassifier()
+    result = model.fit(xs, ys, n_classes, alpha)
+    ```
+3. **Visualize the classification**: Use the visualization function to see how the model separates the data into different classes:
+
+    ```python
+    x_p, predictions = result.viz_everything(xs)
+    plt.scatter(x_p[:, 0], x_p[:, 1], c=predictions)
+    plt.show()
+    ```
+
+## Improvements
+
+**1. Incorporating &gamma; Regularization**:  
+
+One potential improvement to the current implementation is the inclusion of the &gamma; parameter, which allows further regularization of the covariance matrices. 
+
+The formula for &gamma;-based regularization is:
+
+$$
+\hat{\Sigma}(\gamma) = \gamma \hat{\Sigma} + (1 - \gamma) \hat{\sigma}^2 \mathbf{I}
+$$
+Where:
+  - &#x3A3; is the covariance matrix (either class-specific or pooled).
+  - &#x3C3;<sup>2</sup> **I** represents the scalar covariance matrix, where **I** is the identity matrix and &#x3C3;<sup>2</sup> is the variance.
+  - &gamma; is the regularization parameter:
+    - &gamma; = 1: full covariance matrix.
+    - &gamma; = 0: scalar covariance matrix.
+
+Adding this parameter gives more flexibility in managing the balance between bias and variance, allowing a smoother shift between complex covariance structures and simpler ones.
+
+**2. Automate the generation of data**:  
+
+An improvement to the model could be automating the data generation process by allowing users to simply specify the number of classes, features, and the range for the number of samples. The centroids and covariance matrices would then be randomly generated within those parameters instead of manually inserting them.
diff --git a/generate_regression_data.py b/generate_regression_data.py
@@ -1,6 +1,5 @@
 import argparse
 import csv
-
 import numpy
 
 def linear_data_generator(m, b, rnge, N, scale, seed):

diff --git a/regularized_discriminant_analysis/models/RegularizedDiscriminantAnalysis.py b/regularized_discriminant_analysis/models/RegularizedDiscriminantAnalysis.py
@@ -1,17 +1,123 @@
+import matplotlib
+import matplotlib.pyplot as plt
+import numpy
 
+matplotlib.rcParams['figure.figsize'] = (20, 10)
 
 class RDAModel():
-    def __init__(self):
+
+    def _init_(self):
         pass
+
+    def fit(self, xs, ys, n_classes, alpha):
+        self.n_k = [] # Number of samples for each class
+        self.x = [] # Data points for each class
+        self.mu = [] # Mean vector for each class
+        self.pi = [] # Prior Probabilities
+        cov = []
+        cov_final = None # Pooled covariance matrix 
+        self.sigma = []
 
+        # Handle multiple classes
+        for i in range(n_classes):
+            # Number of data in a class
+            self.n_k.append(numpy.where(ys == i)[0].shape[0])
+            # Data from each class
+            self.x.append(xs[numpy.where(ys == i)[0],:])
+            # Get the mean of all data of each class
+            self.mu.append(numpy.mean(self.x[i], axis=0)) 
+            # Get the prior probabilities (pis) of each class
+            self.pi.append(self.n_k[i] / numpy.sum(self.n_k))
+            # Covariance matrix of each class
+            cov.append(numpy.matmul((self.x[i] - self.mu[i]).T, self.x[i] - self.mu[i])/self.n_k[i])
+            # Previous calculation to sigma
+            if cov_final is None:
+                cov_final = cov[i]
+            else:
+                cov_final += cov[i]
+
+        # Compute pooled covariance matrix for regularization
+        # (Shared covariance matrix = The same for each class)
+        cov_final /= n_classes
+
+        # Use regularization parameter alpha
+        # (Regularized covariance matrixes = A different matrix for each class)
+        for i in range(n_classes):
+            self.sigma.append(alpha*cov[i] + (1 - alpha)*cov_final)
+            # When alpha = 1 QDA (each class has its covariance matrix)
+            # When alpha = 0 LDA (pooled covariace matrox)
+
+        return RDAModelResults(self.pi, self.mu, self.sigma)
 
-    def fit(self, X, y):
-        return RDAModelResults()
+class RDAModelResults():
+    def __init__(self, pi, mu, sigma):
+        self.pi = pi
+        self.mu = mu
+        self.sigma = sigma
 
+    def predict(self, xs):
+        num_classes=len(self.mu)
+        scores = [] # contains discriminant scores
+        for i in range(num_classes):
+            sigma_inv = numpy.linalg.inv(self.sigma[i]) 
+            # distance = numpy.matmul(numpy.matmul((p - self.mu[i]).T, sigma_inv), (p - self.mu[i]))
+            distance = numpy.einsum('ij,ij->i', (xs - self.mu[i]) @ sigma_inv, (xs - self.mu[i]))
+            d_score = -0.5*distance + numpy.log(self.pi[i])-0.5*numpy.log(numpy.linalg.det(self.sigma[i]))
+            scores.append(d_score)
+
+        scores = numpy.array(scores)  
+        predictions = numpy.argmax(scores, axis=0) 
+        #predictions = predictions.reshape(xs.shape)  
+
+        return predictions
+
+    def predict_LL(self, x):
+        LL = self._log_loss(x)
+        exp_LL = numpy.exp(LL)
+        probabilities = exp_LL / numpy.sum(exp_LL, axis=0)
+        return probabilities
+
+    def viz_everything(self, xs, ys):
 
-class RDAModelResults():
-    def __init__(self):
-        pass
+        num_classes = len(self.mu)
+        sigma_inv_x = []
+        p =[]
+        x_p =[]
+
+        # Projection of xs to reduce data dimension into 2D (just for visualizing purposes)
+        if xs.shape[0] > 2:
+            for i in range(num_classes):
+                sigma_inv_x.append(numpy.linalg.inv(self.sigma[i])) 
+                p.append(numpy.linalg.eig(sigma_inv_x[i])[1][:,-2:])
+                x_p.append(numpy.matmul(xs[numpy.where(ys == i)[0],:], p[i]))
+
+            x_p = numpy.vstack(x_p)
+
+        else:
+            x_p=xs
+
+        predictions = self.predict(xs)
+        predictions = numpy.vstack(predictions)
+        return x_p, predictions
+
+    def _log_loss(self, x):
+
+        k = len(self.mu)  #n of classes
+        log_loss = numpy.zeros(k)
+
+        for i in range(k):
+            sigma_inv_i = numpy.linalg.inv(self.sigma[i])
+            for j in range(i+1, k):
+                sigma_inv_j = numpy.linalg.inv(self.sigma[j])
+                sigma_inv_mui = numpy.matmul(sigma_inv_i, x - self.mu[i])
+                sigma_inv_muj = numpy.matmul(sigma_inv_j, x - self.mu[j])
+
+                first_term = 1/2*numpy.log(numpy.linalg.det(self.sigma[j])/numpy.linalg.det(self.sigma[i]))
+                second_term = -1/2*numpy.matmul((x - self.mu[i]).T, sigma_inv_mui)
+                third_term = 1/2*numpy.matmul((x - self.mu[j]).T, sigma_inv_muj)
+                forth_term = numpy.log(self.pi[i]/self.pi[j])
 
-    def predict(self, x):
-        return 0.5
+                log_loss[i] = first_term + second_term + third_term + forth_term
+
+        return log_loss
+