Build state-of-the-art faster NLP models in Tensorflow 2


Inside the world of NLP, we have seen so much progress in the last few years. And we can easily divide it as Pre-BERT era and Post-BERT era. Transformer based models have been dominating this industry from last 3 years . There are quite a bunch of variations of transformer based models to solve some drawbacks, still the core idea remains the same. From 124M parameter models to 13B parameter models, the progress is rapid and fast. These progress pose us with the challenge of how these huge models can be used in production especially for startups and medium-tier companies. This is where optimisation and clever engineering comes into picture.

With the release of huggingface, things are become quite accessible and easy for the normal users. With the aim to democratize NLP, huggingface is one of the greatest thing that happens to NLP users. I being an avid NLP user, find huggingface great, but the support for Tensorflow is minimal and that insists me to dig deeper and find where things are not going well with Tensorflow. I have been working on this from last 1 year, on my personal time.

The idea of modifying huggingface source code was not possible, because making TF models serializable and at the same time, making it generalizable was a hard task. But, with few tricks and some compromises, tf-transformers can be used to solve almost al NLP problems.

tf-transformers is Tensorflow with Nitro-boost.


columns GREEDY — (batch_size, sequence_length)

Greedy Decode (GPT2 — few samples)
Greedy (Full comparison)

columns BEAM — (batch_size, sequence_length, beam_size)

Beam Decode (GPT2)

columns — (batch_size, sequence_length, num_return_sequences)

Beam ( Full Comparison — Refer Github for HQ image )
Top-K-Top-p (GPT2)
Top-K-Top-P (Full Comparison — Refer Github for HQ image)

For full benchmark results and code, please refer github. tf-transformers surpasses huggingface transformers in all experiments. When comparing to PyTorch, tf-transformers is faster ( 179 / 220 ) experiments, but not by a huge margin though. Similar results holds for T5 models also. All the experiments are run on V100 GPU.


Variable Batch Decoding

tf-transformers support variable batch_size decoding even in serialized models. This makes the decoding even faster.

Multiple Mask Mode

There are 3 types of mask_mode values. causal, user_defined, prefix . By default GPT2 has causal masking. Just by changing it to prefix we can use it for text generation tasks like summarisation, where its always better to have bi-directional context. For, MLM user_defined masking should be used. This can be done by changing one argument while initializing the model.

Fast Sentence Piece Alignment

LCS method for Squad v1 training examples take 2300 seconds, where fast-alignment takes only ~300 seconds.

Encoder Decoder Models

Any Encoder Model ( BERT, GPT2 etc ) can be converted into decoder mode to have extra cross attention layers with just few keyword arguments. If the encoder hidden state is different from the decoder hidden states, it will automatically projected with a random layer, which can be fine-tuned together.

Keras + model.compile2

There is a custom trainer, if we are using only single GPU machine.

Note: compile2 doesnt supports metrics yet

Super Fast Decoders via Serialization

Write/Read/Process TFRecords in ~5 lines

TFProcessor is useful when you want to evaluate models performance on dev set, where shuffling is not required. This has complete support for tf.ragged tensors. All these has out of the box auto-batching support which works in ~90% cases.

HuggingFace Converters


tf-transformers supports BERT, Albert, RoBERTA, T5, mt5, GPT2. There is still a wide gap between normal tutorials and industry based application sin NLP. Primary focus of tf-transformers tis to bridge that gap without compromising speed and ease of use.


Tutorials will be following up on Github( release .