Signup/Sign In

Scikit-learn CountVectorizer in NLP

Posted in Machine Learning   LAST UPDATED: JULY 1, 2023

    Whenever we work on any NLP-related problem, we process a lot of textual data. The textual data after processing needs to be fed into the model. Since the model doesn't accept textual data and only understands numbers, this data needs to be vectorized.

    What do I mean by vectorized?

    Before we use text for modeling we need to process it. The steps include removing stop words, lemmatizing, stemming, tokenization, and vectorization. Vectorization is a process of converting text data into a machine-readable form. The words are represented as vectors.

    scikit-learn countvectorizer for NLP

    However, our main focus in this article is on CountVectorizer. Let's get started by understanding the Bag of Words model first.

    Bag of Words(BoW) Model

    As mentioned above, we cannot pass text directly to train our models in Natural Language Processing, thus we need to convert it into numbers, which the machine can understand and can perform the required modeling on it. The Bag of Words(BoW) model is a fundamental(and old) way of doing this.

    The BoW model is very simple as it discards all the information and order of the text and just considers the occurrences of the word, in short, it converts a sentence or a paragraph into a bag of words with no meaning. It converts the documents to a fixed-length vector of numbers.

    A unique number is assigned to each word(generally index of an array) along with the count representing the number of occurrence of that word. This is the encoding of the words, in which we are focusing on the representation of the word and not on the order of the word.

    There are multiple ways with which we can define what this 'encoding' would be. Our focus in this post is on Count Vectorizer.

    CountVectorizer

    CountVectorizer tokenizes(tokenization means breaking down a sentence or paragraph or any text into words) the text along with performing very basic preprocessing like removing the punctuation marks, converting all the words to lowercase, etc.

    The vocabulary of known words is formed which is also used for encoding unseen text later.

    An encoded vector is returned with a length of the entire vocabulary and an integer count for the number of times each word appeared in the document. Let's take an example to see how it works.

    Consider the following sentence:

    Out of all the countries of the world, some countries are poor, some countries are rich, but no country is perfect.

    So this text will be represented as follows:

    Table A.

    out of all the countries world some are poor rich but no country is perfect
    doc 1 2 1 2 3 1 2 2 1 1 1 1 1 1 1

    Table B.

    index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
    doc 1 2 1 2 3 1 2 2 1 1 1 1 1 1 1

    From the tables above we can see the CountVectorizer sparse matrix representation of words. Table A is how you visually think about it while Table B is how it is represented in practice.

    The row of the above matrix represents the document, and the columns contain all the unique words with their frequency. In case a word did not occur, then it is assigned zero correspondings to the document in a row.

    Imagine it as a one-hot encoded vector and due to that, it is pretty obvious to get a sparse matrix with a lot of zeros.

    The scikit-learn library offers functions to implement Count Vectorizer, let's check out the code examples to understand the concept better.

    Using Scikit-learn CountVectorizer:

    In the below code block, we have a list of text. Here each row is a document. We are keeping it short to see how Count Vectorizer works.

    First things first, let's do the import. Also, observe the document containing the list of documents we are going to process:

    from sklearn.feature_extraction.text import CountVectorizer
    
    document=["devastating social and economic consequences of COVID-19",
    "investment and initiatives already ongoing around the world to expedite deployment of innovative COVID-19",
    "We commit to the shared aim of equitable global access to innovative tools for COVID-19 for all",
    "We ask the global community and political leaders to support this landmark collaboration, and for donors",
    "In the fight against COVID-19, no one should be left behind"]

    The second step is to initialize the object cv_doc for using CountVectorizer and fitting it on our document:

    cv_doc = CountVectorizer(document)
    cv_vector = cv_doc.fit_transform(document)
    

    The text has been preprocessed, tokenized(word-level tokenization: means each word is a separate token), and represented as a sparse matrix. The best part is it ignores single characters during tokenization like I and a.

    To check the complete vocabulary we can write

    #checking the vocabulary
    
    print(cv_doc.vocabulary_)

    This is what our vocab looks like.


    {'devastating': 17, 'social': 39, 'and': 6, 'economic': 19, 'consequences': 14, 'of': 33, 'covid': 15, '19': 0, 'investment': 28, 'initiatives': 26, 'already': 5, 'ongoing': 35, 'around': 7, 'the': 41, 'world': 46, 'to': 43, 'expedite': 21, 'deployment': 16, 'innovative': 27, 'we': 45, 'commit': 12, 'shared': 37, 'aim': 3, 'equitable': 20, 'global': 24, 'access': 1, 'tools': 44, 'for': 23, 'all': 4, 'ask': 8, 'community': 13, 'political': 36, 'leaders': 30, 'support': 40, 'this': 42, 'landmark': 29, 'collaboration': 11, 'donors': 18, 'in': 25, 'fight': 22, 'against': 2, 'no': 32, 'one': 34, 'should': 38, 'be': 9, 'left': 31, 'behind': 10}

    Note: The numbers here are not the count, they are the positions in the sparse vector.

    Let's check the length and shape of our vocabulary,

    print('The length of vocabulary', len(cv_doc.get_feature_names()))
    
    #Shape returned (5,47) means 5 rows(sentences) and 47 columns(unique words)
    print('The shape is', cv_vector.shape)
    
    #In case you are wondering what get_feature_names would return 
    
    cv_doc.get_feature_names()


    The length of vocabulary 47
    The shape is (5, 47)

    Further, there are some additional parameters you can play with. Let's go through some of them one by one.

    1. Stop Words:

    You can pass the stop_words list as an argument. The stop words are words that are not significant and occur frequently. For example the, and, is, in, etc. are stop words. The stop words can be passed as a custom list or a predefined list of stop words can be used by specifying the language. In this case, we are using English stopwords.

    Passing Custom List of Stop Words:

    First of all, we are passing a custom list of stop words that we don't want to consider in the vocabulary.

    cv1 = CountVectorizer(document,stop_words=['the','we','should','this','to'])
    
    #Lets test cv1 on our doc
    
    cv1_doc = cv1.fit_transform(document)
    
    # after removing stop_words now number of unique words 
    # reduced from 47 to 42 and shape returned is (5,42)
    
    print(cv1_doc.shape)


    (5, 42)

    To check the stop words used, we can use the following code:

    #check out the stop_words you specified
    print(cv1.stop_words)

    Using a Predefined set of Stop words:

    There is a predefined set of stop words which is provided by CountVectorizer, for that, we just need to pass stop_words='english' during initialization:

    cv2 = CountVectorizer(document,stop_words='english')
    
    #Lets test cv2 on our doc
    
    cv2_doc = cv2.fit_transform(document)
    
    #after removing stop_words now number of unique words 
    #reduced from 47 to 30 and shape returned is (5,30)
    
    print(cv2_doc.shape)

    2. Using min_df:

    The min_df argument equals a number that specifies how much importance you want to give to the less frequent words in the document. There might be some words that appear only once or twice and may qualify as noise.

    What does min_df do?

    The min_df argument considers words that are only present in a minimum of 2 documents. We can also pass a proportion instead of an absolute number.

    For example, min_df=0.25 ignores words that are present in less than 25% of the document.

    #new initialization with min_df=2
    
    cv3 = CountVectorizer(document, min_df=2)
    
    cv3_doc = cv3.fit_transform(document)
    
    # you will see a lot of words here
    
    print("Stop words: ",cv3.stop_words_)
    
    #Woah! only 10 out of 47 left
    
    print("Shape: ",cv3_doc.shape)
    
    # We can see a lot of the words are removed as they are 
    # present in less than 2 documents, this is the vocabulary 
    # we have now
    
    print("Vocabulary:", cv3.vocabulary_)


    Stop words: {'collaboration', 'all', 'left', 'community', 'ask', 'devastating', 'access', 'equitable', 'consequences', 'support', 'no', 'be', 'ongoing', 'leaders', 'tools', 'aim', 'in', 'commit', 'deployment', 'this', 'world', 'around', 'political', 'economic', 'social', 'already', 'fight', 'shared', 'investment', 'expedite', 'should', 'initiatives', 'donors', 'against', 'behind', 'one', 'landmark'}
    Shape: (5, 10)
    Vocabulary: {'and': 1, 'of': 6, 'covid': 2, '19': 0, 'the': 7, 'to': 8, 'innovative': 5, 'we': 9, 'global': 4, 'for': 3}

    3. Using max_df:

    Similarly to min_df there is max_df which indicates the importance you want to give to the most frequent words. There might be some words that are very frequent and you don't want to include in your vocab, in that case, max_df is used.

    It's the opposite to min_df and considers words based on their presence in the maximum n number of documents specified.

    Let's test the proportion instead of the absolute number here. If words are present in more than 50% of the document they are ignored.

    cv4 = CountVectorizer(document,max_df=0.50)
    
    cv4_doc = cv4.fit_transform(document)
    
    print("Vocabulary: ", cv4.vocabulary_)
    
    print("Shape: ",cv4_doc.shape)
    
    # these are the only words that crossed 
    # the limit of presence in 50% of document
    
    print("Stop words: ", cv4.stop_words_)


    Vocabulary: {'devastating': 14, 'social': 35, 'economic': 16, 'consequences': 12, 'investment': 25, 'initiatives': 23, 'already': 4, 'ongoing': 31, 'around': 5, 'world': 40, 'expedite': 18, 'deployment': 13, 'innovative': 24, 'we': 39, 'commit': 10, 'shared': 33, 'aim': 2, 'equitable': 17, 'global': 21, 'access': 0, 'tools': 38, 'for': 20, 'all': 3, 'ask': 6, 'community': 11, 'political': 32, 'leaders': 27, 'support': 36, 'this': 37, 'landmark': 26, 'collaboration': 9, 'donors': 15, 'in': 22, 'fight': 19, 'against': 1, 'no': 29, 'one': 30, 'should': 34, 'be': 7, 'left': 28, 'behind': 8}
    Shape: (5, 41)
    Stop words: {'the', '19', 'of', 'covid', 'to', 'and'}

    4. Tokenizer:

    If you want to specify your custom tokenizer, you can create a function and pass it to the count vectorizer during the initialization. We have used NLTK library to tokenize our text in the example below:

    #defining custom tokenizer, we can tokenize the document easily with libraries like nltk
    
    import nltk
    
    nltk.download('punkt')
    
    from nltk.tokenize import word_tokenize
    
    #custom function
    
    def tok(text):
    
        tokens = word_tokenize(text)
        return tokens
    
    #tok(str(document))
    
    cv5 = CountVectorizer(document, tokenizer=tok)
    
    cv5.fit_transform(document)
    
    print(cv5.vocabulary_)

    5. Custom Preprocessing:

    The same goes for preprocessing if you want to include a stemmer and lemmatizer for preprocessing the text, you can define a custom function just like we did for the tokenizer. Although our data is clean in this post, the real-world data is very messy and in case you want to clean that along with Count Vectorizer you can pass your custom preprocessor as an argument to Count Vectorizer. Keeping the example simple, we are just lowercasing the text followed by removing special characters.

    import regex as re
    
    def custom_preprocessor(text):
        #lowering the text case
        text = text.lower() 
        # remove special chars
        text = re.sub("\\W"," ",text)
        return text
    
    cv6 = CountVectorizer(document, preprocessor=custom_preprocessor)
    
    cv6.fit_transform(document)
    
    print(cv6.vocabulary_)

    6. n-grams:

    A combination of words sometimes is more meaningful. Let's say we have the words 'sunny' and 'day', 'sunny day' combined makes more sense. This is Bigram. We can also use character level and word level n-grams. ngram_range=(1,2) specifies we want to consider both unigrams(single words) and bigrams(combination of 2 words).

    cv7 = CountVectorizer(document, ngram_range=(1,2))
    
    cv7.fit_transform(document)
    
    print(cv7.vocabulary_)

    7. Limiting Vocabulary size:

    We can mention the maximum vocabulary size we intend to keep using max_features. In this example we are going to limit the vocabulary size by 20.

    cv7 = CountVectorizer(document,max_features=20)
    
    cv7_doc = cv7.fit_transform(document)
    
    print(cv7.vocabulary_)

    Conclusion:

    Before you go, let's combine most of the things we learned in this post:

    Phew! That's all for now. CountVectorizer is just one of many methods to deal with textual data. The TF-IDF and embeddings are better methods to vectorize the data. More on that later.

    Drop any questions in the comments and don't forget to share this with your friends. Stay Curious!

    Frequently Asked Questions(FAQs)

    1. What is CountVectorizer in Scikit-learn?

    CountVectorizer is a class in Scikit-learn that converts a collection of text documents into a matrix of token counts. It enables the transformation of textual data into a numerical representation suitable for machine learning algorithms.

    2. How does CountVectorizer work in NLP?

    CountVectorizer tokenizes text documents, converting them into a matrix where each row represents a document, and each column represents a unique token. The values in the matrix represent the frequency of each token in each document.

    3. What are n-grams in CountVectorizer?

    N-grams in CountVectorizer refer to contiguous sequences of n tokens. By including n-grams, you can capture not only individual words but also phrases and contextual information. For example, setting n=2 would generate bigrams, and n=3 would generate trigrams.

    4. How can I remove stop words using CountVectorizer?

    Scikit-learn's CountVectorizer provides a parameter called "stop_words" that allows you to specify a list of words to be considered as stop words. Stop words are common words like "and," "the," or "is" that is typically removed as they do not provide significant meaning in the analysis.

    5. Can I customize the preprocessing steps in CountVectorizer?

    Yes, CountVectorizer provides various parameters to customize the preprocessing steps. You can define your own tokenizer, specify regular expressions for token pattern matching, and even apply custom pre-processing functions to manipulate the text data before vectorization.

    You may also like:

    About the author:
    I am an undergraduate pursuing B.Tech Computer Science, currently in 3rd year.I love to manipulate data to make sense out of it, my interests include machine learning and deep learning.I love to code in python.
    Tags:scikitnlp
    IF YOU LIKE IT, THEN SHARE IT
     

    RELATED POSTS