Tokenization plays an essential role in NLP as it helps convert the text to numbers which deep learning models can use for processing.
No deep learning models can work directly with the text. You need to convert it into numbers or the format which the model can understand.
Bert is based on transformer architecture and currently one of the best in the field of NLP. It uses the Subword tokenization method for tokenizing the text.
This blog post will learn about the subword tokenization method and the words that Bert algorithm knows.
Orginal Bert model is already trained by google using Wikipedia and Book corpus. This essentially means the model already knows words or more professionally Vocabulary.
Let's start by installing Transformer
! pip install Transformer
Let us download the vocabulary using the following code.
Let us start looking deep into the vocabulary to understand what it contains.
From the vocab file, the first 999 tokens are reserved(Why?) and contain some Special tokens like [PAD] — padding [CLS]-Classification,[SEP],[Mask], and [UNK]-Unknows
Immediately follows with a single Letter word, It contains all the single letter, and well I found the words Single words for Kannada, Tamil, Hindi, etc.
With hash, I am not sure why did they add the hash. It's something to look at the paper again.
The first English word starts after 1997 and goes on further. Even though it contains these characters, it is not ideal to use the BERT model for Multi-language classification problems.
Average words Length
So on avg, the words are seven characters.
Out of domain Vocabulary
Thirty thousand words are good, but that’s not all the words in the dictionary. So how does Bert deal with the situation? In another, how can the Bert model understand the words if it’s not in the vocabulary?
To help this situation, BERT uses a tokenization technique called Subword tokenization.
Bert takes the word, and if it is unknown, it split the words to their known roots.
I found this way to understand any word in the dictionary and keep vocab size to a minimum. Clever!!!
One more critical concept in tokenization by Bert is the use of unique tokens. In this case, [PAD] is used for padding the token. If the max_length need for a model is 64 or 512, you can use [PAD] to encode those characters.
Similarly, [UNK] is for Unknown words, Yes, which can happen even after tokenization.
[CLS] [MASK] and [SEP] are helpful in a particular downstream task like classification, Mask language modeling, etc.
One Last Thing
As usual Jay Almmar proves that the Picture is worth 1000 words
Thanks to Chris Mccormick — http://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/#22-tokenization