How to Create N-Grams From Text in Python

Olorunfemi Akinlua Feb 02, 2024
  1. Use the for Loop to Create N-Grams From Text in Python
  2. Use nltk to Create N-Grams From Text in Python
How to Create N-Grams From Text in Python

In computational linguistics, n-grams are important to language processing and contextual and semantic analysis. They are continuous and consecutive sequences of words adjacent to one another from a string of tokens.

The popular ones are unigrams, bigrams, and trigram, and they are effective, and where n>3, there can be data sparsity.

This article will discuss how to create n-grams in Python using features and libraries.

Use the for Loop to Create N-Grams From Text in Python

We can effectively create a ngrams function which takes the text and the n value, which returns a list that contains the n-grams.

To create the function, we can split the text and create an empty list (output) that will store the n-grams. We use the for loop to loop through the splitInput list to go through all the elements.

The words (tokens) are then appended to the output list.

def ngrams(input, num):
    splitInput = input.split(" ")
    output = []
    for i in range(len(splitInput) - num + 1):
        output.append(splitInput[i : i + num])
    return output


text = "Welcome to the abode, and more importantly, our in-house exceptional cooking service which is close to the Burj Khalifa"
print(ngrams(text, 3))

The output of the code

[['Welcome', 'to', 'the'], ['to', 'the', 'abode,'], ['the', 'abode,', 'and'], ['abode,', 'and', 'more'], ['and', 'more', 'importantly,'], ['more', 'importantly,', 'our'], ['importantly,', 'our', 'in-house'], ['our', 'in-house', 'exceptional'], ['in-house', 'exceptional', 'cooking'], ['exceptional', 'cooking', 'service'], ['cooking', 'service', 'which'], ['service', 'which', 'is'], ['which', 'is', 'close'], ['is', 'close', 'to'], ['close', 'to', 'the'], ['to', 'the', 'Burj'], ['the', 'Burj', 'Khalifa']]

Use nltk to Create N-Grams From Text in Python

The NLTK library is a natural language toolkit that provides an easy-to-use interface to resources important for text processing and tokenization, among others. To install nltk, we can use the pip command below.

pip install nltk

To show us a potential issue, let’s use the word_tokenize() method, which helps us create a tokenized copy of the text we pass to it using NLTK’s recommended word tokenizer before we move on to writing a more detailed code.

import nltk

text = "well the money has finally come"
tokens = nltk.word_tokenize(text)

The output of the code:

Traceback (most recent call last):
  File "c:\Users\akinl\Documents\Python\SFTP\n-gram-two.py", line 4, in <module>
    tokens = nltk.word_tokenize(text)
  File "C:\Python310\lib\site-packages\nltk\tokenize\__init__.py", line 129, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
  File "C:\Python310\lib\site-packages\nltk\tokenize\__init__.py", line 106, in sent_tokenize
    tokenizer = load(f"tokenizers/punkt/{language}.pickle")
  File "C:\Python310\lib\site-packages\nltk\data.py", line 750, in load
    opened_resource = _open(resource_url)
  File "C:\Python310\lib\site-packages\nltk\data.py", line 876, in _open
    return find(path_, path + [""]).open()
  File "C:\Python310\lib\site-packages\nltk\data.py", line 583, in find
    raise LookupError(resource_not_found)
LookupError:
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/english.pickle[0m

  Searched in:
    - 'C:\\Users\\akinl/nltk_data'
    - 'C:\\Python310\\nltk_data'
    - 'C:\\Python310\\share\\nltk_data'
    - 'C:\\Python310\\lib\\nltk_data'
    - 'C:\\Users\\akinl\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
    - ''
**********************************************************************

The reason for the above error message and issue is the NLTK library requires certain data for some methods, and we have not downloaded the data, especially if this is your first use. Therefore, we need the NLTK downloader to download two data modules, punkt and averaged_perceptron_tagger.

The data is available for use, for example, when using the methods such as words(). To download the data, we need the download() method if we need to run it through our Python script.

You could create a Python file and run the below code to solve the issue.

import nltk

nltk.download("punkt")
nltk.download("averaged_perceptron_tagger")

Or run the following commands through your command line interface:

python -m nltk.downloader punkt
python -m nltk.downloader averaged_perceptron_tagger

Example Code:

import nltk

text = "well the money has finally come"
tokens = nltk.word_tokenize(text)

textBigGrams = nltk.bigrams(tokens)
textTriGrams = nltk.trigrams(tokens)

print(list(textBigGrams), list(textTriGrams))

The output of the code:

[('well', 'the'), ('the', 'money'), ('money', 'has'), ('has', 'finally'), ('finally', 'come')] [('well', 'the', 'money'), ('the', 'money', 'has'), ('money', 'has', 'finally'), ('has', 'finally', 'come')]

Example Code:

import nltk

text = "well the money has finally come"
tokens = nltk.word_tokenize(text)

textBigGrams = nltk.bigrams(tokens)
textTriGrams = nltk.trigrams(tokens)

print("The Bigrams of the Text are")
print(*map(" ".join, textBigGrams), sep=", ")

print("The Trigrams of the Text are")
print(*map(" ".join, textTriGrams), sep=", ")

The output of the code:

The Bigrams of the Text are
well the, the money, money has, has finally, finally come
The Trigrams of the Text are
well the money, the money has, money has finally, has finally come
Olorunfemi Akinlua avatar Olorunfemi Akinlua avatar

Olorunfemi is a lover of technology and computers. In addition, I write technology and coding content for developers and hobbyists. When not working, I learn to design, among other things.

LinkedIn