Practical data science and optimization: classification algorithms

By I_g_o_r | Practical data science & optimization in examples | 26 Oct 2022

Our quality of life depends on quality of decisions we make. If we make bad decisions we loose a lot and feel insecure and miserable. If we make good decisions we win a lot and feel confident, healthy and prosperous.

Quality of our decisions depends on two factors:

-quality of data/information on which our decisions are based;

-efficiency of methods we use to extract insights and knowledge from the data/information on which our decisions are based.

The most frequent problem we deal with in making our decisions is the classification problem. We constantly consciously or sub-consciously classify things, people, actions, etc. Our brains classify people on familiar/unfamiliar, our immune systems classify chemical and/or biological structures in our blood as dangerous/non-dangerous etc.

Today, we train computers to help us in solving different complex classification problems. A recent example of how “mining” and classification of old data resulted in a discovery of the biggest gold deposits is the case of Robert Kiosaki company. The researchers of the company digitized old maps with gold mining operations and then classified all places on two classes: with high probability of gold deposits and low probability of gold deposits. To verify that the classification was correct they checked places with high probability of gold deposits and discovered the biggest gold deposits. The full interview with Robert Kiyosaki is available here: https://www.youtube.com/watch?v=ffmHhxoEMYc

Many countries and private companies today are in a race to mine asteroids. The big problem in this business is the high cost of an error. If a company sends a spacecraft to an asteroid and discovers that there is not enough of gold (or other resource/mineral) on the asteroid it may be bankrupt. Companies with abilities to correctly classify asteroids with high probability of success will be doing well in this business.

In this post we consider how to build classification models from a dataset to classify data into a fixed set of classes. We will use datasets with digital images of ten digits from 0 to 9.

First of all, we need to understand why such problem as recognition of images can be solved by algorithms. In our case we have images of ten digits. For simplicity of explanation, let us assume that there are two types of pixels black and white. Every image will have a statistical distribution of dark pixels close to some digit. Therefore, by analyzing these statistical distributions of pixels we can design an algorithm of image classification.

The first model uses the Gaussian Naive Bayes algorithm to classify images. Copy the text below into a file model1.py and run it with the command python3 model1.py. Python is very sensitive to TABs, therefore make sure that all TABs in the code are preserved in the file model1.py. If during a coping process TABs were replaced on spaces you should delete the spaces and insert TABs.

from sklearn.model_selection import train_test_split

from sklearn import metrics

from sklearn.datasets import load_digits

from sklearn.naive_bayes import GaussianNB

def model1():

# Gaussian Naive Bayes Classification

print("Gaussian Naive Bayes Classification data=digits")

# load data

digits = load_digits()

# split the data into training and validation sets

X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target)

# train the model

clf = GaussianNB()

clf.fit(X_train, y_train)

# make prediction on X_train

y_predict=clf.predict(X_train)

# calculate accuracy

accuracy=metrics.accuracy_score(y_train,y_predict)

print("accuracy train=",accuracy)

# use the model to predict the labels of the test data

y_predict=clf.predict(X_test)

# calculate accuracy

accuracy=metrics.accuracy_score(y_test,y_predict)

print("accuracy test=",accuracy)

cm=metrics.confusion_matrix(y_test,y_predict)

print("confusion matrix=\n",cm)

return

# main

for i in range(3):

print("Run ",i+1)

model1()

The results are shown below.

Accuracy measure is a ratio of a number of correct results of classifications to the total number of classifications. As we can see the accuracy on training sets is higher than on testing sets.

A confusion matrix shows numbers of correct results of classifications on the main diagonal and incorrect results outside of the main diagonal.

The second model uses the Linear Discriminant Analysis Classification algorithm to classify images.

Copy the text below into a file model2.py and run it with the command python3 model2.py. Python is very sensitive to TABs, therefore make sure that all TABs in the code are preserved in the file model2.py. If during a coping process TABs were replaced on spaces you should delete the spaces and insert TABs.

from sklearn.model_selection import train_test_split

from sklearn import metrics

from sklearn.datasets import load_digits

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as lda

def model2():

# Linear Discriminant Analysis

print("Linear Discriminant Analysis Classification data=digits")

# load data

digits = load_digits()

# split the data into training and validation sets

X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target)

# train the model

model=lda()

model.fit(X_train, y_train)

# make prediction on X_train

y_predict=model.predict(X_train)

# calculate accuracy

accuracy=metrics.accuracy_score(y_train,y_predict)

print("accuracy train=",accuracy)

# use the model to predict the labels of the test data

y_predict=model.predict(X_test)

# calculate accuracy

accuracy=metrics.accuracy_score(y_test,y_predict)

print("accuracy test=",accuracy)

cm=metrics.confusion_matrix(y_test,y_predict)

print("confusion matrix=\n",cm)

return

# main

for i in range(3):

print("Run ",i+1)

model2()

The results are shown below.

The third model uses the Random Forest Classification algorithm to classify images.

Copy the text below into a file model3.py and run it with the command python3 model3.py. Python is very sensitive to TABs, therefore make sure that all TABs in the code are preserved in the file model3.py. If during a coping process TABs were replaced on spaces you should delete the spaces and insert TABs.

from sklearn.model_selection import train_test_split

from sklearn import metrics

from sklearn.datasets import load_digits

from sklearn.ensemble import RandomForestClassifier as rfc

def model3():

# Random Forest Classifier

print("Random Forest Classifier data=digits")

# load data

digits=load_digits()

# split the data into training and validation sets

X_train,X_test,y_train,y_test=train_test_split(digits.data, digits.target)

# train the model

model=rfc(random_state=0,n_estimators=100)

model.fit(X_train, y_train)

# make prediction on X_train

y_predict=model.predict(X_train)

# calculate accuracy

accuracy=metrics.accuracy_score(y_train,y_predict)

print("accuracy train=",accuracy)

# use the model to predict the labels of the test data

y_predict=model.predict(X_test)

# calculate accuracy

accuracy=metrics.accuracy_score(y_test,y_predict)

print("accuracy test=",accuracy)

cm=metrics.confusion_matrix(y_test,y_predict)

print("confusion matrix=\n",cm)

return

# main

for i in range(3):

print("Run ",i+1)

model3()

The results are shown below.

Now, we compare the three models above using as a criterion for comparison an average accuracy on testing datasets.

As we can see the Random Forest Classifier has the best average accuracy on the testing sets among the three classification algorithms.

If you want to experiment with other classifiers, the table below will be useful.

To see how classification algorithms are used to predict results in sports see this link: https://www.researchgate.net/publication/348787411_Sport_Result_Prediction_Using_Classification_Methods

To see how machine learning and classification algorithms are used in cryptocurrency trading see this link: https://jfin-swufe.springeropen.com/articles/10.1186/s40854-020-00217-x

To see how classification algorithms are used in medicine see this link: https://aip.scitation.org/doi/10.1063/5.0076768

To see how machine learning and classification algorithms are used to predict lottery numbers see this link: https://medium.com/@polanitzer/how-to-guess-accurately-3-lottery-numbers-out-of-6-using-lstm-model-e148d1c632d6

To see how classification algorithms are used in prediction of weather see this link:

https://www.researchgate.net/publication/272482887_Weather_Prediction_using_Classification

To see how machine learning and classification algorithms are used to predict stock market index see this link:

https://pdf.sciencedirectassets.com/280203/1-s2.0-S1877050915X00068/1-s2.0-S1877050915004688/main.pdf?X-Amz-Security-Token=IQoJb3JpZ2luX2VjEB0aCXVzLWVhc3QtMSJGMEQCIGo0HDyDmMbktS60v%2F4PfJyGp3ajIn5ALcuVmuSSiz3EAiARxpudC5kdhP4x2uFJsJgQaEe%2ByuIEey%2F4AQV71vNWiyrVBAj1%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F8BEAUaDDA1OTAwMzU0Njg2NSIMBpRfvhKuaj9gfpK5KqkE%2BvF5YIsGJCZaYsbnFLjnd08uU%2FNe7JyTdY4RE8NMbRE303JSgRWFQZ0nPBL2pK8VeOvSGhkWcAhjQJztnn3Bwu05x258vQr3PBmHi4YNTJsOoVkLRXlVbe2x1cznucM6z27RBamgtOSrq94vLtePpq%2FS62%2BWQ2alQQXBTJzIBhrwett06%2FQf2YKoM%2FSqaUblUBiEAWbLeG%2BwzfuFl5Ylgvl%2BULKxlaDpHRXN09bqygy%2BB9TFcz2i57ule%2BVHJGlMdACh6t%2FlYnJSepxQQSYz1TKmvVhn1gQkax%2BPN4xyaasIC3n7yvwqhDTS4rH57Zx2sdjL9uax1ZQrrx8t0pIGBdmS39fKIivMss0KsFLZltvrTO1XyWzuzIPH08X6T8nFjHHWhJLIhdBdlJ9a0%2BwOAKRIcvwfJ%2BIymx1DWu7PrMq4vKM3gciqkqJzshFvNzCDyZVdHonojUR49i4ZGgUsdJ6e44GS5OZaGSSLPANVlHQKXekvnsvg8PTWfJkcGOA5lahSU1v0Fgt5mQ362s8Mi0V4zxN2mUtrez4yoCkXjQ%2FXMyVbY8EWdem9RqHluHBxwJfSAeCGwGLcP%2BZvFOrdXJFTNOV4ta8Wuu%2Fv5ORA4HMRRtzk1WI31RoIqjxuVt8Y15VoBMfwXUSYCW0rmBZ6V6%2BDx%2BGMKZTM8TDFJP2NKRZzR3pO38%2FlU3DAcwbNr2Lg6WOJo%2FEk4DQitfxYAFBvGGW%2F8ZqyDOL88DDjquaaBjqqAQIygyIwSsDpFdGZFBGS0FG2jXJByUh3NLOmFFl94626evAqSGe%2FbSGlUVx99ZVio2TzC45NTK21E9uvJF47edfha9%2B2e5iwOgW1q%2BQ6Eb%2FCoxX2QZAC%2Bn4fIy%2FXMaBZ2R4oIpLEXpMvBzXwohnzrGHpTU%2B%2FkE%2BRZ8zaE7aHPiKHt0QDRij4QtMl6IO0R2RdY19o8%2Bjv23wU%2FXNpmkcCWJuy1554doLg%2FK1X&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20221026T215922Z&X-Amz-SignedHeaders=host&X-Amz-Expires=300&X-Amz-Credential=ASIAQ3PHCVTYTH272EUX%2F20221026%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=772e932a4708c80f73eedfd4e3f2420a8a3642c0770b853ac07b296c7190b8dc&hash=3dff78f8088bcd060246eef51d7f36231553a91f94ad2e35b97e76519b35fb77&host=68042c943591013ac2b2430a89b270f6af2c76d8dfd086a07176afe7c76c2c61&pii=S1877050915004688&tid=spdf-91adc886-0480-447e-b961-5514c0effe19&sid=2a02ed9b9c200349419a5af56df2040dcd62gxrqa&type=client&ua=4d50505555535501520957&rr=7606756a0f9c541f

Cryptocurrency Artificial Intelligence Machine Learning classification

How do you rate this article?

I_g_o_r

I am curious about science, technologies and their applications to solving real problems.

Practical data science & optimization in examples

This blog gives readers practical examples of data science and optimization problems and their solutions using python scripts. These scripts can be used to solve real problems in business and life.