Tokenization for Bert Models

3 min readMay 30, 2021

Tokenization plays an essential role in NLP as it helps convert the text to numbers which deep learning models can use for processing.

No deep learning models can work directly with the text. You need to convert it into numbers or the format which the model can understand.

Bert is based on transformer architecture and currently one of the best in the field of NLP. It uses the Subword tokenization method for tokenizing the text.

This blog post will learn about the subword tokenization method and the words that Bert algorithm knows.

PreTrained Model

Orginal Bert model is already trained by google using Wikipedia and Book corpus. This essentially means the model already knows words or more professionally Vocabulary.

Let's start by installing Transformer

! pip install Transformer

Let us download the vocabulary using the following code.

Let us start looking deep into the vocabulary to understand what it contains.

From the vocab file, the first 999 tokens are reserved(Why?) and contain some Special tokens like [PAD] — padding [CLS]-Classification,[SEP],[Mask], and [UNK]-Unknows

Immediately follows with a single Letter word, It contains all the single letter, and well I found the words Single words for Kannada, Tamil, Hindi, etc.

With hash, I am not sure why did they add the hash. It's something to look at the paper again.

The first English word starts after 1997 and goes on further. Even though it contains these characters, it is not ideal to use the BERT model for Multi-language classification problems.

Average words Length

So on avg, the words are seven characters.

Out of domain Vocabulary

Thirty thousand words are good, but that’s not all the words in the dictionary. So how does Bert deal with the situation? In another, how can the Bert model understand the words if it’s not in the vocabulary?

To help this situation, BERT uses a tokenization technique called Subword tokenization.

Subword Tokenization

Bert takes the word, and if it is unknown, it split the words to their known roots.

I found this way to understand any word in the dictionary and keep vocab size to a minimum. Clever!!!

Special Token

One more critical concept in tokenization by Bert is the use of unique tokens. In this case, [PAD] is used for padding the token. If the max_length need for a model is 64 or 512, you can use [PAD] to encode those characters.
Similarly, [UNK] is for Unknown words, Yes, which can happen even after tokenization.

[CLS] [MASK] and [SEP] are helpful in a particular downstream task like classification, Mask language modeling, etc.