Signup/Sign In
LAST UPDATED: OCTOBER 1, 2019

Creating your first Classifier which can classify Handwritten Digits using Python

    In this section, we will write a program that can classify the digits based upon the handwritten digit inputs. Here, we are going to use MNIST Handwritten digits dataset [link: https://www.kaggle.com/c/digit-recognizer/data#] which contains 42.000 samples for training the model. But I will be using this file both for training and testing the model. I am assuming that you are familiar with classification and you have some idea about the scikit-learn library, if not then check out the following articles before you move on with this one:

    Introduction to Machine Learning

    Linear and Logistic Regression

    Introduction to Matpotlib Plotting

    making a ml classifier

    In this blog, I am going to write a simple program that can classify handwritten digits and this program can be hugely improved by certain techniques such as grid search, hyperparameter tuning and many more. The IDE that I have used is google collaboratory but you can use any ide of your choice. The first step to start coding the problem is to import the required libraries. So we'll first import the required libraries that we are going to use throughout our program.

    # handling imports
    import numpy as np
    import pandas as pd
    from sklearn.tree import DecisionTreeClassifier
    from sklearn import svm
    from sklearn.ensemble import RandomForestClassifier

    So if you haven't installed these libraries yet then you can install using the PIP command in python. Then we will proceed with loading our dataset from our local system to pandas so that we can manipulate easily. So what actually pandas does is that it treats the data as a panel and it becomes so convenient to do operations.

    # for google colab user only
    
    from google.colab import files
    
    # first you need to upload the file from 
    # your local system to google drive
    uploaded = files.upload()
    
    # Then we will use pandas
    import io
    
    df = pd.read_csv(io.StringIO(uploaded['train.csv'].decode('utf-8')))
    print(df.head())
    
    # for other ide's
    # load the dataset
    
    df = pd.read_csv("train.csv");
    # printing first 5 rows
    df.head()

    Then the next step would be working with data for that as I have told earlier as well, that I am going to use train.csv file both for training and testing purpose. So first we will import the features into a vector and the respective labels in another vector.

    Then we will use those two vectors to split the data for training(fitting it into the model) and testing (prediction). For splitting the data I have used train_test_split method from scikit-learn that contains a parameter named test_size which takes an argument which decides how much from your data that you are going to use for testing and the rest of the data would be used for training.

    from sklearn.model_selection import train_test_split as tts
    X_train, X_test, Y_train, Y_test = tts(X, Y, test_size = 0.2)  # 0.2 indicates 20% of data used for testing
    
    # call the model [DescisonTreeClassifer]
    clf = DecisionTreeClassifier()
    
    # Y_train = column_or_1d(Y_train, warn = true)
    # train the model with training data
    clf.fit(X_train, Y_train)
    
    # predict the model
    predicted = clf.predict(X_test)
    print(predicted)
    # print(len(predicted))
    # print(len(Y_test))
    # len(X_test)
    
    # uncomment the below 3 lines if you want to use RandomForest as a Classifier
    # rf = RandomForestClassifier()
    # rf.fit(X_train, Y_train)
    # rf.predict(X_test)
    
    # uncomment these 3 lines if you want to use Support Vector Machine as a Classifier
    # md = svm.SVC(kernel = 'linear')
    # md.fit(X_train, Y_train)
    # md.predict(X_test)

    Output:

    [8 0 5 ... 7 7 5]

    Here I have printed the predicted data, you can print anything you want to know about. The more you explore the more you get to know about your dataset. The method clf.fit(X_train, Y_train) is used to train the model while training and it takes two arguments, first one is the feature and the second one is the respective label. So that our model can learn the pattern.

    Then it's time to predict, which means its time to see how our trained model is going to perform on the data that the model has not seen before. Here we use clf.predict(X_test) method and it takes only one argument(feature) so that it will predict the label and then we'll use the predicted data to compare with the actual label Y_test and calculate the accuracy of our model. One thing to note here is that I have used a DecisionTreeClassifier() for classification and I have commented the other two models named RandomForestClassifer() and SupportVectorMachine().

    So I would recommend you to use those models as well in the program so that you can compare which model is performing better for this data. There are tons of models available if you want to know more about it then you can visit official scikit-learn documentation.

    Till now our model has been trained and started predicting. So now we will calculate our model's accuracy that means how many data points has been correctly classified. I have used two methods, the first one is inbuilt function and second one is custom made so that you can understand it better.

    # inbulit function for accuracy calculation!
    from sklearn.metrics import accuracy_score
    print(accuracy_score(Y_test, predicted))
    
    # custom accuracy calculation
    count = 0
    for i in range(len(X_test)):
        if(predicted[i] == Y_test[i]):
            count += 1
    
    print("Accuracy :", (count/len(predicted)*100))

    Here we have compared the predicted values with the actual label and the calculated the accuracy. So there are tons of things to discuss but this is the simple implementation of a machine learning model using scikit-learn.

    Additionally, if you want to visualize one row and the predict, then write this program:

    # sample data
    d = xtest[8]    # can use any index below 42000
    d.shape = (28, 28)
    plt.imshow(255-d, cmap = "gray")    # we have 255-d because I want white background with black colour
    plt.show()
    print(clf.predict([xtest[8]]))

    Output:

    hand writing recognition machine learning

    So if have any doubts you can ask me in the comment section below and if you want to read more articles that I have written you can visit my profile.

    Incoming Software Engineer @Vedantu, Codeforces (1765, expert). Former Summer Intern @Wikimedia Foundation(GSoC), @Egnify, @Vedantu.
    IF YOU LIKE IT, THEN SHARE IT
    Advertisement

    RELATED POSTS