Signup/Sign In

Introduction to NLP using NLTK Library in Python

Posted in Machine Learning   LAST UPDATED: SEPTEMBER 14, 2019

    NLP (Natural Language Processing) is a subfield of Computer Science and Artificial intelligence which involves making computers to successfully process natural language (like English, French, Hindi and so on for easy interaction with humans).

    Natural Language Understanding(NLU) is a subset of NLP which is still the greatest challenge in Artificial Intelligence domain as it involves making the machine/system understand what a user is saying/communicating. Although, with a good NLP system in place, we can perform some of the major tasks like breaking a sentence, comprehending its meaning, responding to the user (Automated voice/text as a reply to user's command), determining the action in the sentence etc.




    Importance of NLP

    Natural Language is the language used by humans for communication either in the form of text or speech. With the increasing ways to socialize through Social Media platforms and websites, we are having access to a huge amount of natural language in the form of Blogs, books, reports, emails, reviews, tweets, etc., With this huge data, when properly annotated, a machine can be made intelligent to understand the human language hence improving data analysis, categorization and collection.


    What is Annotation?

    Annotation is the process of adding metadata (information useful to describe data) to markup the elements of a dataset. Using which, when a machine is trained it can work with more precision and effectiveness.

    To understand the basic tasks of NLP, let us take the Sentiment Analysis application to better understand the concepts of NLP.

    If we need to understand the sentiments of a user based on his review of a Hotel or a movie, first we need to take the text of the review published. This whole text(sentence or paragraph) is divided into separate words. Then, in the program, we maintain a bag of words which can identify the sentiment score of any particular word.

    For Example:

    Mr. Peter reviewed endgame as follows:

    'Avengers: Endgame' is of course entirely preposterous and, yes, the central plot device here does not, in itself, deliver the shock of the new. But the sheer enjoyment and fun that it delivers, the pure exotic spectacle, are irresistible, as is its insouciant way of combining the serious and the comic.
    - Source Peter Bradshaw review

    The words that are bold in the above example text are positive words, which makes the review positive. Of course, in real, sentiment analysis is much more complex and lots of metrics and other measures are involved, which are then used to train a classifier like Naïve Bayes and then predict the sentiment score of a new review/text.

    But in this example, I want to focus mainly on the tasks to be carried out in any NLP based algorithm or problem.




    Some Basic Tasks in NLP:

    1. Tokenization

    2. Parts-of-Speech Tagging

    3. Stemming

    4. Lemmatization

    5. Stop Word Removal

    We are going to see how to perform these tasks using Python's NLTK (Natural Language Tool Kit) library.

    Requirements:

    • Python 3.6 or Higher

    • NLTK

    After installing Python and PIP, use the following command to download and install all the required data analysis libraries like lexicons, corpora and other tools of NLTK


    Installing NLTK

    Use the following pip commands to install the nltk library,

    For Windows:

    pip install nltk
    For MacOS/Linux:

    pip install -U nltk

    The above commands will only install NLTK core library, but not the lexicons and corpora. These can be downloaded by following the below specified steps.

    Open an interactive python shell and execute the following command:

    import nltk
    
    nltk.download()

    or create a python file(downloadNltk.py) with the following commands and execute the file using the command: python downloadNltk.py

    NLTK download Console

    Executing the command will open a GUI window for prompting the user to select and download the corpora and lexicons tools.

    Select all option and click Download, this will download all the packages. You don't need to change the download directory, it's better to leave it as is. This will take some time and after a few minutes, this will complete downloading the packages and then our system will be ready to use NLTK with all the modules in our program.

    NLTK package downloader interface

    A simpler way to download all these packages using a single line command is as follows:

    python –m nltk.downloader all



    NLP: Tokenization

    The process of breaking the sentence or set of text into individual words called tokens (in languages like English) is called Tokenization. There are two types of tokenization:

    • Word Tokenization

    • Sentence Tokenization

    You might think this task of tokenization can be performed manually by us, simply by writing a small piece of code, splitting the text at the full stop or space, or something else. Nut we do not have to worry about all this because this is handled by the NLTK library's tokenize module.

    Code Snippet for Sentence and Word Tokenization:

    from nltk.tokenize import sent_tokenize, word_tokenize
    
    text = input("Please Enter a Paragraph: ")
    sent_tokens = sent_tokenize(text,language='english')
    
    print("Tokenized Sentence's")
    print(f'Number of Sentences in the given Paragraph are {len(sent_tokens)}')
    print(sent_tokens)
    
    word_tokens = word_tokenize(text)
    
    print("Tokenized Words")
    print(f'Number of Tokens/Words in the given Paragraph are {len(word_tokens)}')
    print(word_tokens)

    The first line of the program imports the sent_tokenize(sentence tokenizer) and word_tokenize(word tokenizer) modules from the nltk.tokenize module.

    Syntax for sent_tokenize:

    sent_tokenize(text, language='english')

    When the above code is executed, a prompt to the user is shown to enter a paragraph, this raw text is given to sent_tokenize() method as an input, which will return a list of sentences(tokens) in the given paragraph. The sent_tokenize() method takes a required parameter text and an optional parameter language. This optional parameter by default uses English as the language for tokenizing.

    Syntax for word_tokenize:

    word_tokenize(text, language='english', preserve_line=False)

    Similarly word_tokenize() method works just the same accepting text as a required parameter and optional parameter as language, just like the sent_tokenize() method.
    This method returns a list of words/tokens.

    In the image below we have shown an input text(Endgame's review) and the output for sent_tokenize and word_tokenize methods:

    Word Tokenizer example NLTK




    NLP: Parts-Of-Speech Tagging (POS tagging)

    Parts of speech tagging is the process of attaching the Parts-Of-Speech to each and every word/token with their specific tag. POS tagging helps in a lot of applications, like taking a sentence and identify the action happening based on VERB tag from the POS-tagged text. Similarly there are a lot of applications based on this POS tagging. The pos_tag method is available in the NLTK module and can be imported by using the following statement:

    from nltk import pos_tag

    Syntax for pos_tag:

    pos_tag(tokens, tagset=None, lang='eng')

    As shown in the syntax, the pos_tag() method takes tokens as input(which can be generated using the word_tokenize() method). It also has the optional parameters lang for defining the language for POS tagging and the tagset parameter to use which tagset(brown or universal or wsj) to be used for tagging the parts-of-speech. The lang parameter accepts different languages like Russian and can be used by passing the value as rus.

    The following program takes a sentence and then tokenize it first and the generated tokens are sent to this pos_tag(), the output is a list of tuples. Each tuple is having the token and a tag for that token.

    Code Snippet for Parts-Of-Speech tagging:

    from nltk import pos_tag
    
    from nltk.tokenize import sent_tokenize,word_tokenize
    text = input("Please Enter a Sentence:")
    word_tokens = word_tokenize(text)
    pos_tokens = pos_tag(word_tokens)
    print(pos_tokens)

    Output:

    Please Enter a Sentence:
    Studytonight is a great websiste for learning
    ('Studytonight', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('great', 'JJ'), ('websiste', 'NN'), ('for', 'IN'), ('learning', 'VBG')

    To understand these tags, you might need a reference. To get this reference you can use the inbuilt help command in the python interpreter shell.

    nltk.help.upenn_tagset()

    Or you can use this in your program as follows:

    from nltk.help import upenn_tagset
    
    print(upenn_tagset())

    This will display all the POS tags along with their meaning and some example words. For example:

    VB: verb, base form
    VBP: verb, present tense, not 3rd person singular
    VBG: verb, present participle or gerund

    If you are going to perform POS tagging for multiple sentences, this pos_tag method works fine, but we have a more efficient method named pos_tag_sents() which tags the sentences with respective POS tags.




    NLP: Stemming

    Consider a large document, the document contains a lot of information in the form of text. Let us suppose you wanted to write down a program which counts the frequency of different words in the document. You might first use the tokenization technique to get a list of words, and then write a few lines of code to find the frequency of words. But in the later stage, you might find that, there are words which have the same meaning but have different morphological forms.

    For example, the word standard can be written in the following ways based on the requirement of the author:

    standard, standards, standardization etc

    If you really want to calculate the unique word frequencies, you can just simply apply Stemming on these words, and then apply your code to calculate the word frequency.

    The PortStemmer module is imported using the following statement, available in nltk.stem

    from nltk.stem import PortStemmer

    In order to use this PortStemmer class, we have to use following code statements:

    portStemmer = PortStemmer()
    portStemmer.stem(word)

    Using this PortStemmer object we will access the stem() method which takes a word/token and will return the available root word. The method actually converts the word to lower case and then apply its logic to return the root word.

    Code Snippet for Stemming:

    from nltk.tokenize import word_tokenize
    from nltk.stem import PorterStemmer
    
    text = input("Please Enter a Sentence: ")
    word_tokens = word_tokenize(text)
    
    # Create an object of the class PortStemmer
    portStemmer = PorterStemmer()
    for word in word_tokens:
    print(word," ==> ",portStemmer.stem(word))
    Output:
    Please Enter a Sentence : standard Standardization
    standard ==> standard
    Standardization ==> standard

    In the program, we just did Suffix Stripping, which is nothing but removing suffixes from a word. Some of the common suffixes are -es, -ing, etc.

    This causes some of problems with Stemming. To understand those problems let us try to run the above Stemming Code snippet with the word Multiplicativity. If you think that after stemming the root word will be Multiple or Multiplicative, then you are wrong.

    After stemming of the word multiplicativity, the root word is returned as multipl, removing the inflection -icative suffix.

    So in Stemming, the problem is that it will remove the inflection and gives a word, which may or may not be a word with a meaning.




    NLP: Lemmatization

    Unlike stemming, lemmatization rather gives the root word which has a meaning and does belong to the same language. But this doesn't mean that Lemmatization is always better than Stemming, it varies based on your application and domain where you are using them.

    The WordNetLemmatizer module is imported using the following statement, available in nltk.stem

    from nltk.stem import WordNetLemmatizer

    In order to use this WordNetLemmatizer class, we have to create an object for it using the following statement and then call the lemmatize() method using it. This will return the proper root word belonging to the language.

    lemmatizer = WordNetLemmatizer()
    
    lemmatizer.lemmatize(word)

    Code Snippet for Lemmatization:

    from nltk.tokenize import word_tokenize
    from nltk.stem import WordNetLemmatizer
    
    text = input("Please Enter a Sentence : ")
    word_tokens = word_tokenize(text)
    
    # Create an object of the class WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()
    for word in word_tokens:
    print(word," ==> ",lemmatizer.lemmatize(word))

    Output:

    Please Enter a Sentence: multiplicativity beautiful Multiplicative
    
    multiplicativity ==> multiplicativity
    beautiful ==> beautiful
    Multiplicative ==> Multiplicative
    



    NLP: Stop Word Removal

    Stop words are the words which are commonly used in a language, some of the stop words are: the, be, at, by, can, cannot, me, my, etc. In most of the cases we might not need these words for our task. Hence we remove these words before or after processing of natural language data(text).

    In most of the sentiment analysis applications, stop words are not needed, because they don't have any effect on the sentiment of the sentence. Removing these stop words can increase efficiency by reducing unnecessary processing on those words.

    We know that in our library NLTK, there are a lot of corpus already downloaded during the installation. So we already have a list of Stop Words as a corpus in the NLTK library.

    We can import it using the following statement:

    from nltk.corpus import stopwords

    Now we have to configure this Stopwords class to get the stopwords of any particular language. To get the stopwords list use the following statement:

    stopwordsList = stopwords.words("english")

    This returns a list of stop words in that language. Print the stopwords list to see all the stopwords.

    print(stopwordsList)

    Here we sent the parameter English which means it will return the stop words in English Language. Some of the other languages you can pass are as follows: Spanish, French, Russian etc.

    Script for removing Stop words:

    <.from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize
    
    stopwordsList = stopwords.words("english")
    text = input("Please enter a sentence: ")
    tokenized_words = word_tokenize(text)
    filteredWords = [word for word in tokenized_words if word not in stopwordsList]
    print(filteredWords)

    Output:

    Please enter a sentence: There are lots of people in the Museum
    ['There', 'lots', 'people', 'Museum']

    You can observe that all the stop words are removed from the list. If you have a list of Stop Words that you want, you can just load them from a TEXT/CSV file and store it in a list.

    As you have seen the basic tasks that can be done using NLP, let us see some of the real-world applications where NLP plays a major role.




    Applications of NLP

    Following are some of the real-world applications in which NLP is used:

    1. ChatBots

    2. Neural Machine Translation

    3. Summarising large Documents

    4. Sentiment Analysis

    5. Gain Market Intelligence – Using unstructured Text

    6. Advertisement and

    7. Healthcare

    About the author:
    Software Engineer| AI | ML | Geek
    Tags:PythonNLTKNLP
    IF YOU LIKE IT, THEN SHARE IT
     

    RELATED POSTS