Our quality of life depends on quality of decisions we make. If we make bad decisions we loose a lot and feel insecure and miserable. If we make good decisions we win a lot and feel confident, healthy and prosperous.
Quality of our decisions depends on two factors:
-quality of data/information on which our decisions are based;
-efficiency of methods we use to extract insights and knowledge from the data/information on which our decisions are based.
The most frequent problem we deal with in making our decisions is the classification problem. We constantly consciously or sub-consciously classify things, people, actions, etc. Our brains classify people on familiar/unfamiliar, our immune systems classify chemical and/or biological structures in our blood as dangerous/non-dangerous etc.
Today, we train computers to help us in solving different complex classification problems. A recent example of how “mining” and classification of old data resulted in a discovery of the biggest gold deposits is the case of Robert Kiosaki company. The researchers of the company digitized old maps with gold mining operations and then classified all places on two classes: with high probability of gold deposits and low probability of gold deposits. To verify that the classification was correct they checked places with high probability of gold deposits and discovered the biggest gold deposits. The full interview with Robert Kiyosaki is available here: https://www.youtube.com/watch?v=ffmHhxoEMYc
Many countries and private companies today are in a race to mine asteroids. The big problem in this business is the high cost of an error. If a company sends a spacecraft to an asteroid and discovers that there is not enough of gold (or other resource/mineral) on the asteroid it may be bankrupt. Companies with abilities to correctly classify asteroids with high probability of success will be doing well in this business.
In this post we consider how to build classification models from a dataset to classify data into a fixed set of classes. We will use datasets with digital images of ten digits from 0 to 9.
First of all, we need to understand why such problem as recognition of images can be solved by algorithms. In our case we have images of ten digits. For simplicity of explanation, let us assume that there are two types of pixels black and white. Every image will have a statistical distribution of dark pixels close to some digit. Therefore, by analyzing these statistical distributions of pixels we can design an algorithm of image classification.
The first model uses the Gaussian Naive Bayes algorithm to classify images. Copy the text below into a file model1.py and run it with the command python3 model1.py. Python is very sensitive to TABs, therefore make sure that all TABs in the code are preserved in the file model1.py. If during a coping process TABs were replaced on spaces you should delete the spaces and insert TABs.
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.datasets import load_digits
from sklearn.naive_bayes import GaussianNB
def model1():
# Gaussian Naive Bayes Classification
print("Gaussian Naive Bayes Classification data=digits")
# load data
digits = load_digits()
# split the data into training and validation sets
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target)
# train the model
clf = GaussianNB()
clf.fit(X_train, y_train)
# make prediction on X_train
y_predict=clf.predict(X_train)
# calculate accuracy
accuracy=metrics.accuracy_score(y_train,y_predict)
print("accuracy train=",accuracy)
# use the model to predict the labels of the test data
y_predict=clf.predict(X_test)
# calculate accuracy
accuracy=metrics.accuracy_score(y_test,y_predict)
print("accuracy test=",accuracy)
cm=metrics.confusion_matrix(y_test,y_predict)
print("confusion matrix=\n",cm)
return
# main
for i in range(3):
print("Run ",i+1)
model1()
The results are shown below.

Accuracy measure is a ratio of a number of correct results of classifications to the total number of classifications. As we can see the accuracy on training sets is higher than on testing sets.
A confusion matrix shows numbers of correct results of classifications on the main diagonal and incorrect results outside of the main diagonal.
The second model uses the Linear Discriminant Analysis Classification algorithm to classify images.
Copy the text below into a file model2.py and run it with the command python3 model2.py. Python is very sensitive to TABs, therefore make sure that all TABs in the code are preserved in the file model2.py. If during a coping process TABs were replaced on spaces you should delete the spaces and insert TABs.
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.datasets import load_digits
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as lda
def model2():
# Linear Discriminant Analysis
print("Linear Discriminant Analysis Classification data=digits")
# load data
digits = load_digits()
# split the data into training and validation sets
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target)
# train the model
model=lda()
model.fit(X_train, y_train)
# make prediction on X_train
y_predict=model.predict(X_train)
# calculate accuracy
accuracy=metrics.accuracy_score(y_train,y_predict)
print("accuracy train=",accuracy)
# use the model to predict the labels of the test data
y_predict=model.predict(X_test)
# calculate accuracy
accuracy=metrics.accuracy_score(y_test,y_predict)
print("accuracy test=",accuracy)
cm=metrics.confusion_matrix(y_test,y_predict)
print("confusion matrix=\n",cm)
return
# main
for i in range(3):
print("Run ",i+1)
model2()
The results are shown below.

The third model uses the Random Forest Classification algorithm to classify images.
Copy the text below into a file model3.py and run it with the command python3 model3.py. Python is very sensitive to TABs, therefore make sure that all TABs in the code are preserved in the file model3.py. If during a coping process TABs were replaced on spaces you should delete the spaces and insert TABs.
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier as rfc
def model3():
# Random Forest Classifier
print("Random Forest Classifier data=digits")
# load data
digits=load_digits()
# split the data into training and validation sets
X_train,X_test,y_train,y_test=train_test_split(digits.data, digits.target)
# train the model
model=rfc(random_state=0,n_estimators=100)
model.fit(X_train, y_train)
# make prediction on X_train
y_predict=model.predict(X_train)
# calculate accuracy
accuracy=metrics.accuracy_score(y_train,y_predict)
print("accuracy train=",accuracy)
# use the model to predict the labels of the test data
y_predict=model.predict(X_test)
# calculate accuracy
accuracy=metrics.accuracy_score(y_test,y_predict)
print("accuracy test=",accuracy)
cm=metrics.confusion_matrix(y_test,y_predict)
print("confusion matrix=\n",cm)
return
# main
for i in range(3):
print("Run ",i+1)
model3()
The results are shown below.

Now, we compare the three models above using as a criterion for comparison an average accuracy on testing datasets.

As we can see the Random Forest Classifier has the best average accuracy on the testing sets among the three classification algorithms.
If you want to experiment with other classifiers, the table below will be useful.

To see how classification algorithms are used to predict results in sports see this link: https://www.researchgate.net/publication/348787411_Sport_Result_Prediction_Using_Classification_Methods
To see how machine learning and classification algorithms are used in cryptocurrency trading see this link: https://jfin-swufe.springeropen.com/articles/10.1186/s40854-020-00217-x
To see how classification algorithms are used in medicine see this link: https://aip.scitation.org/doi/10.1063/5.0076768
To see how machine learning and classification algorithms are used to predict lottery numbers see this link: https://medium.com/@polanitzer/how-to-guess-accurately-3-lottery-numbers-out-of-6-using-lstm-model-e148d1c632d6
To see how classification algorithms are used in prediction of weather see this link:
https://www.researchgate.net/publication/272482887_Weather_Prediction_using_Classification
To see how machine learning and classification algorithms are used to predict stock market index see this link: