MM

Group project - Machine learning (P.11)

Ken Edward Andrea Natalia

Report

Project strategy

STEP 1 Exploratory Data Analysis (EDA)

1. Data Loading: The script loads the "Indian Pines" dataset and its ground truth labels. The dataset includes information about different bands and their corresponding labels.
2. Data Preprocessing: The script performs data cleaning and preprocessing steps, including standardization using StandardScaler. It checks if the data has been previously computed; if not, it processes the data and saves it for future use.
3.Visualizations: Visualizations of the ground truth labels and correlation matrix.

STEP 2 Dimensionality Reduction (PCA and LDA)

4. Principal Component Analysis (PCA): If PCA is enabled (--pca option), the script applies PCA to reduce the dimensionality of the data. Visualizations include the explained variance ratio plot and scatter plots of principal components.
5. Linear Discriminant Analysis (LDA): If LDA is enabled (--lda option), the script applies LDA for further dimensionality reduction. Visualizations include the explained variance ratio plot and scatter plots of linear components.

STEP 3 Model Training and Testing

6. Data Splitting: The dataset is split into training and testing sets based on the specified ratio (--test option).
7. Model Training: The script supports various classifiers such as Random Forest (--RF), Support Vector Classifier (--SVC), Logistic Regression (--LogR), and Gaussian Naive Bayes (--GNB). Model training is performed using the training set, and hyperparameter tuning is conducted using GridSearchCV.
8. Model Testing: If a separate test set is specified, the trained models are applied to make predictions on the test set.

STEP 4 Model Evaluation and Reporting

9. Model Evaluation: For each trained model, the script evaluates its performance using metrics like accuracy, classification report, confusion matrix, precision, recall, and F1-score. These reports are displayed in the console.

Generally, our script provides insights into the classification performance of different machine learning models on the "Indian Pines" dataset.

Background

Hyperspectral data provide a lot of information for the remote discrimination of ground truth, however, since spectral dimensions are usually many, the possibility of information redundancy is presented. Data analysis and interpretation of hyperspectral images can also be a challenge.

The goal of the group assignment was to explore machine learning tools to analyze hyperspectral images of Indian pine fields to classify land surfaces according to the groud truth provided.

The dataset consists of 200 satellite images of the same area, each corresponds to the one spectral band of the remote sensor. We expect different types of the land surface to have a different reflectivity among those 200 bands. We will make an attempt to classify land types according to their representation on images in different bands.

We also have a "reference": the image that contains "target", classified patterns of the surface, e.g. 'Corn-notill', 'Corn', etc.
Assuming we trained our model on this dataset, e.g. managed to predict the type of the land surface on the satellte imagery, the model can be further applied for the classification of the same 200 bands on the satellite imagery for the other areas.

Exploratory Data Analysis

Important note: All 0 values and values of the target that covered sparsely by the data were removed, or classified as NaNs. The sparsely covered targets are: 0, 1, 7 and 9. In the end we analyse targets: 2, 3, 4, 5, 6, 8, 10-16. 13 in total, each for one type of the land.

Figure 1. Binned distriburion of the image cells with different features (e.g. land types).

Principal Components Analysis

We first explore the data by plotting images for random bands. There are several patterns that can be observed from this simple procedure: this suggests that some land types are clearly distinguishable in different satellite bands.

Figure 2. Example of the satellite images in different spectral bands.

As a first step we apply a Principal Components decomposition to the 200 matrixes of the size 145x145 to see whether PCs are (i) distiguishable between each other and (ii) how many PCs we need to describe most of the variability in the dataset. This analysis allows to see the clusters in the data and quantify the measure of their "separation" to make further decisions for the methods of analysis.

The PCs analysis shows that first 5 PCs explain more than 92% of the total variability in the dataset. While

PC 1 explains 0.68 %
PC 2 explains 0.19 %

There is also a clear clustering of the data points in PCs space (Figure 2), suggesting that data clusters are separated and can be further analyzed successfully with machine learning methods. Figure 3. First 3 PCAs plotted in a 3D space.

The next step was to check whether the reconstructed images only applying first 10 PCs would reflect the main features to be captured by machine learning techniques. Figure 3 demonstrates those reconstructed images and we conclude that images are well reflecting the land features we want to classify. Figure 4. Reconstructed images (applying inverse transform with first 10 PCs) of the different bands.

Kernel Principal Components Analysis

We did all the final runs and provide analytic report with KPCA method for data decomposition, which demostrated better performance in comparison to PCA. Principal Component Analysis (PCA) and Kernel Principal Component Analysis (Kernel PCA) are both dimensionality reduction techniques. The main difference between those two methods is that the Kernel PCA allows for nonlinear dimensionality reduction. It is particularly useful when dealing with data that has nonlinear relationships, as it can capture more complex patterns. In this particular case we choose the kernel with radial basis function (RBF).

Remarks

Exploratory Data analysis of our choice focused on first understanding the dataset, and probing the overall description of the dataset. Pixel sizes (data) contained in 200 bands of image were analyzed for the presence of redundancy of the data they all held.

This was achieved through the assessment of interband correlation. Of the first 15 bands, band1 had the weakest correlation with the remainig bands (bands2-band15), showing a very strong correlation between band2 to 15 with coefficients ranging between 0.7 to 0.9 in most combinations.

The correlation coefficients of the bands with the class (species) column was also analyzed. The highest correlation coefficient was estimated to be ~ 0.23. Selected Bands with Correlation Coefficient >= 0.238 with the Class (Specie) Column were as follows:

Band ID	Correlation Coefficient with the Class Column
band147	0.245247
band148	0.245009
band149	0.242812
band150	0.242855
band151	0.238947
band153	0.238003
band155	0.239565
band184	0.238006
band185	0.241086
band188	0.238426
band190	0.239321
band191	0.238504
band192	0.239755
band193	0.241024
band194	0.242920
band195	0.238310
band196	0.240277

It was obvious that these bands were strongly correlated as well, hence any two of them, could most probably be used to train an algorithm to make predictions.

A plot of the pixel distribution of the 'Class' column for band196 is presented below:

Figure 5. Band 196 vs Class

Linear Discriminant Analysis

Figure 6a and 6b demostrate results of a simple Linear Discriminant Analysis (LDA) and a t-Distributed Stochastic Neighbor Embedding (t-SNE).

Figure 6a.

Figure 6b.

LDA is a technique to reduce the dimensionality and help classification, by finding the linear combinations of features that best separate the different classes in the dataset. It is best employed before the application of a classificaton algorithm, by maximizing the distance between the means of different classes and minimizing the spread within each class, thus enhancing the discriminatory power of the features and the accuracy of the classification.

Figure 7. Variance explained after application of the LDA.

LDA allows to significantly improve the variance explained by first 5 PCAs.

Classification report

Note that we dropped class '0' (likely covers the areas that not meant to be classified), based on a preliminary data analysis of the raw dataset we have also dropped other sparsely covered with data classes (e.g. classes 1,7 and 9, where there were too few samples).
The modified dataset is then standardized, fitted, transformed and tested with different classification methods.

Figure 8. Accuracy performance of the different methods for classification of the land surface in the "Indian Pines" dataset.

Overall, the Support Vector Classifier (SVC) with PCA appears to be the best-performing model among all tested, achieving the highest accuracy (83.3%) and balanced class-wise metrics. Random Forest (RF) models also perform well across various configurations. Logistic regression models show moderate performance, and the choice between them might depend on specific considerations, such as interpretability and computational efficiency.

It is also important to notice that additional projection of the KPC transformed data into LDs space gives additional almost 5% of the accuracy for the Random Forest classification, while for the SVC method it decreases accuracy (only 2% decrease).

The worst performing method is GNB, which gives around 60% of accuracy for all tested configurations of KPCA and LDA.

Metrics of performance

Precision, recall, F1-score, and support are metrics for evaluations of the performance of classification models. These metrics are derived from the confusion matrix, which summarizes the performance of a classification algorithm.

Precision: Precision is a positive predictive value, calculated as the ratio of true positive predictions to the sum of true positives and false positives. High precision indicates model predicting a positive class, that is likely to be correct.

Recall: Recall is a sensitivity or true positive rate, that measures the ability of the model to capture all the positive instances. It is calculated as the ratio of true positives to the sum of true positives and false negatives. High recall indicates that the model is effective at identifying most of the positive instances.

F1-Score: The F1-score is the harmonic mean of precision and recall. It provides a balance between precision and recall, and it is useful when there is an uneven class distribution. The F1-score ranges from 0 to 1, where a higher value indicates better overall performance.

Support: Support represents the number of actual occurrences of the class in the specified dataset. It is the number of trues for each class. Support is not a measure of the model's performance but rather provides context for the other metrics.

All these metrics are calculated from the Confusion matrix, which allows to evaluate the performance of a classification models. In the confusion matrix we see the predictions made by a model on a dataset and comparison of them to the actual labels. It operates with true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) values.

True Positives (TP): Correct prediction of the positive class. True Negatives (TN): Correct prediction of the negative class. False Positives (FP): Incorrect prediction of the positive class when the true class is negative (Type I error). False Negatives (FN): Incorrect prediction of the negative class when the true class is positive (Type II error).

	Predicted Negative	Predicted Positive
Actual Negative	TN	FP
Actual Positive	FN	TP

Precision: Precision measures the accuracy of positive predictions. It is the ratio of true positives to the sum of true positives and false positives.
Precision = TP / (TP + FP)
Recall (Sensitivity or True Positive Rate): Recall measures the ability of the model to capture all positive instances. It is the ratio of true positives to the sum of true positives and false negatives.
Recall = TP / (TP + FN)
F1-Score: The F1-score is the harmonic mean of precision and recall. It provides a balance between precision and recall.
F1-Score = 2 * (Precision * Recall) / (Precision + Recall)

Figure 9. Calculated precision of classification models for all classes of the land (13 "targets").

Overall model "Target_KPCA_020_LDA_012_GNB" shows high precision across most classes. Class 5 has consistently high precision across all models. For the recall class 5 again stands out with high recall across all models. Model "Target_KPCA_020_LDA_012_RF" has generally high recall for most classes. Similar to precision and recall, Class 5 maintains a high F1-Score across models. Model "Target_PCA_042_LDA_012_GNB" shows strong performance in terms of F1-Score. Models vary in their performance across different classes. Class 5 appears to be consistently well-predicted across models.

Name		Name	Last commit message	Last commit date
Latest commit History 200 Commits
data		data
img		img
src		src
.gitignore		.gitignore
Dataset.csv		Dataset.csv
IP_Bands.png		IP_Bands.png
IP_Bands_reconstructed.png		IP_Bands_reconstructed.png
IP_GT.png		IP_GT.png
Indian_pines_class_ed.ipynb		Indian_pines_class_ed.ipynb
Indian_pines_classification.ipynb		Indian_pines_classification.ipynb
Indian_pines_classification_and.ipynb		Indian_pines_classification_and.ipynb
Original_Images.png		Original_Images.png
README.md		README.md
Reconstructed_Images.png		Reconstructed_Images.png
accuracies.png		accuracies.png
accuracies_kpca.png		accuracies_kpca.png
analysis.ipynb		analysis.ipynb
band196_vs_class.png		band196_vs_class.png
bashrc_send		bashrc_send
lda_raw.png		lda_raw.png
load.sh		load.sh
precision.png		precision.png
tSNE_raw.png		tSNE_raw.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MM

Report

Project strategy

Background

Exploratory Data Analysis

Classification report

Metrics of performance

Examples of Confusion matrices for different Models

Random Forest

Logistic Regression (LogR)

Support Vector Classification (SVC) - Best performing

Gaussian Naive Bayes (Gaussian NB) - Worst performing

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

VanuatuN/MM

Folders and files

Latest commit

History

Repository files navigation

MM

Report

Project strategy

Background

Exploratory Data Analysis

Classification report

Metrics of performance

Examples of Confusion matrices for different Models

Random Forest

Logistic Regression (LogR)

Support Vector Classification (SVC) - Best performing

Gaussian Naive Bayes (Gaussian NB) - Worst performing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages