Principal Component Analysis - a short explanation

By Karina | Technophilia | 3 Mar 2021

The principal component analysis is a procedure that uses an orthogonal transformation to convert a set of data that has correlated variables into a set of uncorrelated variables called principal components. This is used to get rid of redundant information. An example is transforming the information from an Excel file into a table that still contains the important data.

Create a simple project in python, add the PCS.py class. After creating a folder in the python environment, in the Project tool window click File|New, python file and name it PCS.py. In the constructor of the class add the matrix as a parameter. Please, don't forget to import NumPy, pandas, and PCS packets into your project. as shown in the last code part.

def __init__(self,X):

       self.X=X

Before we go further I think that it is my duty to explain the theoretical and mathematical part. We want to calculate the covariance matrix. I chose the Observation driven approach. The Lagrangian approach is written in the picture below.

fa7bf3161636656fb4d1ae8894160f40a37560ef335a233fccde398850f1924c.png

Therefore a1 is an eigenvector of the matrix 1/n(Xtransposed*X), n is the number of rows. An eigenvector is any vector that only gets scaled.

Next, we have to provide the dimension or axis, averages on columns.

avgVar=np.mean(self.X,axis=0)

stdDevVar=np.std(self.X,axis=0)

self.Xstd=(self.X-avgVar)/stdDevVar

If we have the standardized matrix X noted as Xstd then we can use the covariance matrix like the one for which we compute the eigenvalues and the eigenvectors.

self.R=np.cov(self.Xstd,rowvar=False)

eigenVal,eigenVect=np.linalg.eigh(self.R)

After this, we will sort the eigenvalues and the eigenvectors in descending order.

kReverse=[k for k in reversed(np.argsort(eigenVal))]

self.alpha=eigenVal[kReverse]

self.a=eigenVect[kReverse]

self.C=Xstd @ self.a

We want to multiply the standardized matrix with the matrix of multipliers, eigenvectors. This self. C is the principal component. The @ operator is used for multiplying two matrices. That is equivalent to using matmul() method which overloads @. If we would use matmult() we would have to write like this self.C = np.matmul(self.Xstd, self.a).

I wrote the functions for returning the values at the end of the class.

def getEigenValues(self):

 return self.alpha




def getEigenVectors(self):

 return self.a




def getPrincipalComponents(self):

return self.C

We will also need another file worksheet.py where we are using functions and variables from the PCA.py class and where we are going to read our file 'Teritorial.csv'.

import numpy as np
import PCA as pca
import pandas as pd
import graphics as grp



# what happens if we need random values in the interval [a, b], for any given a and b
def random(a=None, b=None, size=None) :         # [a, b]
     return a + np.random.rand(size) * (b - a)


vector = random(1, 2, 35)
print(type(vector), vector)

X = np.ndarray(shape=(7, 5), dtype=float, buffer=vector, order='C')
print(X)

table = pd.read_csv('Teritorial.csv', sep=',', index_col=0) # we have the label of the rows on the first column
print(table)

obsName = table.index[:]
varName = table.columns[1:]

X = table.iloc[:, 1:].values
print(X)
print(obsName)
print(varName)

n = X.shape[0]
m = X.shape[1]
print('No of observations: ', n)
print('No of variables: ', m)

pcaModel = pca.PCA(X, reg=True)
# print(pcaModel.getEigenValues())

You might not need some lines of code printed in the worksheet.py, we will use them in the next chapter when I will talk more about the Covariance matrix and about graphics. I tried to cover both the teorethical and mathematical part and the code too. If you have any suggestions for my future explanations I will be more than happy to receive them.

Resources