Dark Mode On/Off

Interactive Learning

C Language course

GO Lang course

Learn JavaScript

Learn HTML

Learn CSS

C Language

C Tutorial

C Programs (100+)

C Compiler

Execute C programs online.

C++ Language

C++ Tutorial

Standard Template Library

C++ Programs (100+)

C++ Compiler

Execute C++ programs online.

Python

Python Tutorial

Python Projects

Python Programs

Python How Tos

Numpy Module

Matplotlib Module

Tkinter Module

Network Programming with Python

Learn Web Scraping

Model Evaluation and Hyperparameter Tuning in Machine Learning

Technology #Model Evaluation#Hyperparameter#Data Analysis

To understand Model evaluation and Hyperparameter tuning for building and testing a Machine learning model, we will pick a dataset and will implement an ML algorithm on it, dividing the dataset into multiple datasets.

We will be working with the Breast Cancer Wisconsin dataset, which contains 569 samples of malignant and benign tumor cells. The 1st column in the dataset store the unique ID numbers of the samples and the 2nd column has the corresponding diagnosis (M=malignant, B=benign), respectively for the given ID. The next columns from 3 to 32, contain 30 real-value features that have been computed from digitized images of the cell nuclei, which can be used to build a model to predict whether a tumor is benign or malignant.

The Breast Cancer Wisconsin dataset has been deposited on the UCI machine learning repository and more detailed information about this dataset can be found on the UCI Website.

Reading the Data and Splitting it:

In this section we will read the data from the dataset, and split it into training and test datasets in just three simple steps:

import pandas as pd
import urllib

try:
  df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None)
except urllib.error.URLError:
  df = pd.read_csv('https://raw.githubusercontent.com/rasbt/python-machine-learning-book/master/code/datasets/wdbc/wdbc.data', header=None)

import sklearn
from sklearn.preprocessing import LabelEncoder

# we are assigning the 30 features to X. Using labelEncoder
# we transform the class labels from their original string respresentation.
X = df.loc[:, 2:].values
y = df.loc[:, 1].values
le = LabelEncoder()
y = le.fit_transform(y)
# two dummy class labels
le.transform(['M', 'B'])

if (sklearn.__version__) < '0.18':
  from sklearn.cross_validation import train_test_split
else:
  from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1)

# to see datasets
# print(X_train)
# print(X_test)

The above code will divide the dataset into 2 sets.

Combining Transformers and Estimators in a pipeline:

You have learned that many learning algorithms require input features on the same scale for optimal performance. Thus, we need to standardize the columns in the Breast Cancer Wisconsin dataset before we can feed them to a linear classifier, such as logistic regression. Furthermore, let's assume that we want to compress our data from the initial 30 dimensions onto a lower two-dimensional subspace via principal component analysis (PCA), a feature extraction technique for dimensionality reduction.

Instead of going through the fitting and transformation steps for the training and test dataset separately, we can chain the StandardScaler, PCA, and LogisticRegression objects in a pipeline:

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

pipe_lr = Pipeline([('scl', StandardScaler()),
                    ('pca', PCA(n_components=2)),
                    ('clf', LogisticRegression(random_state=1))])

pipe_lr.fit(X_train, y_train)
print('Test Accuracy: %.3f' % pipe_lr.score(X_test, y_test))
y_pred = pipe_lr.predict(X_test)

The Pipeline object takes a list of tuples as input, where the first value in each tuple is an arbitrary identifier string that we can use to access the individual elements in the pipeline, and the second element in every tuple is a scikit-learn transformer or estimator.

Below we have a running example of the code:

Fine-tuning Machine Learning models via Grid Search:

In machine learning, we have two types of parameters:

those that are learned from the training data, for example, the weights in logistic regression,
and the parameters of a learning algorithm that are optimized separately.

The latter are the tuning parameters, also called hyperparameters, of a model, for example, the regularization parameter in logistic regression or the depth parameter of a decision tree. Now, we will try to understand a very strong hyperparameter optimization technique called grid search that can further help to improve the performance of a model by finding the optimal combination of hyperparameter values.

Tuning hyperparameters via Grid search:

The approach of grid search is quite simple, it's a brute-force exhaustive search paradigm where we specify a list of values for different hyperparameters, and the computer evaluates the model performance for each combination of those to obtain the optimal set:

from sklearn.svm import SVC

if (sklearn.__version__) < '0.18':
    from sklearn.grid_search import GridSearchCV
else:
    from sklearn.model_selection import GridSearchCV

pipe_svc = Pipeline([('scl', StandardScaler()), ('clf', SVC(random_state=1))])

param_range = [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]

param_grid = [{'clf__C': param_range, 'clf__kernel': ['linear']}, {'clf__C': param_range, 'clf__gamma': param_range, 'clf__kernel': ['rbf']}]

gs = GridSearchCV(estimator=pipe_svc, param_grid=param_grid, scoring='accuracy', cv=10, n_jobs=-1)

gs = gs.fit(X_train, y_train)

print(gs.best_score_)

print(gs.best_params_)

Using the preceding code, we initialized a GridSearchCV object from the sklearn.grid_search module to train and tune a support vector machine (SVM) pipeline. We set the param_grid parameter of GridSearchCV to a list of dictionaries to specify the parameters that we'd want to tune. For the linear SVM, we only evaluated the inverse regularization parameter C; for the RBF kernel SVM, we tuned both the C and gamma parameters.

Note that the gamma parameter is specific to kernel SVMs. After we used the training data to perform the grid search, we obtained the score of the best-performing model via the best_score_ attribute and looked at its parameters, that can be accessed via the best_params_ attribute. In this particular case, the linear SVM model with 'clf__C= 0.1' yielded the best k-fold cross-validation accuracy: 97.8 percent.

Conclusion

Now you must have understood the optimization technique for working with the dataset. If you have any doubts ask me in the comment section below and do check out other articles from the curious section.

Reference:

1. Bengio, Yoshua, and Yves Grandvalet. 2004. “No Unbiased Estimator of the Variance of K-Fold Cross-Validation.” J. Mach. Learn. Res. 5 (December). JMLR.org: 1089–1105

C TUTORIAL

C PROGRAMS

INTERVIEW TESTS

EXECUTE CODE

C++ TUTORIAL

C++ PROGRAMS

INTERVIEW TESTS

EXECUTE CODE

PYTHON TUTORIAL

PYTHON HOW TOS

INTERVIEW TESTS

EXECUTE CODE

JAVA TUTORIAL

JAVA CODE EXAMPLES

SPRING TUTORIAL

MORE IN JAVA

COMPUTER ARCHITECTURE

COMPUTER NETWORK

OPERATING SYSTEM

DBMS & SQL

PL/SQL

MongoDB

EXECUTE SQL

ANDROID DEVELOPMENT

GO LANGUAGE

LINUX

DOCKER

HTML TAGS (A to Z)

CSS REFERENCES

SASS/SCSS

KOTLIN

GAME DEVELOPMENT

PHP

GIT GUIDE

JAVASCRIPT

ADVANCED DSA

Model Evaluation and Hyperparameter Tuning in Machine Learning

Table of Contents

Reading the Data and Splitting it:

Combining Transformers and Estimators in a pipeline:

Fine-tuning Machine Learning models via Grid Search:

Tuning hyperparameters via Grid search:

Conclusion

You may also like:

IF YOU LIKE IT, THEN SHARE IT

RELATED POSTS