How we made FastText faster

FastText is a library for efficient text classification and representation learning. Like its sibling, Word2Vec, it produces meaningful word embeddings from a given corpus of text. Unlike its sibling, FastText uses n-grams for word representations, making it great for text-classification projects like language detection, sentiment analysis, and topic modeling.  Here at GIPHY, we use FastText […]

FastText is a library for efficient text classification and representation learning. Like its sibling, Word2Vec, it produces meaningful word embeddings from a given corpus of text. Unlike its sibling, FastText uses n-grams for word representations, making it great for text-classification projects like language detection, sentiment analysis, and topic modeling. 

Here at GIPHY, we use FastText for search query analysis tasks like language prediction and query clustering. Given the tremendous volume of distinct queries we receive every day, we need these tasks to be as performant and low-latency as possible so that our analytics update within a reasonable and useful time-frame.

In this post, we break down how we optimized FastText in various ways, from compilation to training to inference. Follow along to learn how we did it so you can optimize FastText wherever you’re using it.

FastText Overview

Fastext supports both supervised and unsupervised  (cbow, skip gram) training modes, model quantization and automatic hyperparameter tuning. Facebook has published pretrained English word vectors, as well as multilingual word vectors for 157 different languages. Out of the box we can use FastText from bash, C++, and Python.

In terms of architecture, FastText is written in C++ and can be compiled with C++11 and newer versions of C++. Typical C and C++ applications have multiple translation units. This allows developers to have independent compilation of different parts of the program, but it blocks some optimizations.

What slows down C++ applications

Slow algorithms, inefficient data structures, function calls, memory allocations, and cache-misses make our applications slow. With better algorithms, we can do fewer basic operations, which leads to better performance. Polylogarithmic, linear, and quasilinear algorithms guarantee the scalability of our applications.

With efficient data structures, we can save memory and guarantee predictable memory access patterns. Predictable memory access patterns are particularly important. For modern CPUs, RAM behaves pretty much like a hard drive. L1 cache reference costs half a nanoseconds, branch misprediction is ten times more expensive, but the main memory reference is 200 times slower than L1 cache access. So, the performance is dictated by memory access patterns, pluses are free, divisions are expensive, function calls and cache-unfriendly data structures are catastrophic for the performance of our software, and algorithms with the same asymptotic complexity can have surprisingly different performance.

Performance Enhancements

We will introduce two kinds of performance improvements: build-level and code-level.

Let’s start our journey and see how deep the rabbit hole really is. We will take the first 35,653,488 bytes from English Wikipedia and perform some initial measurements. The dataset can be found on Matt Mahoney’s website, and we will do some preprocessing as suggested on the FastText website:

$ git [email protected]:facebookresearch/fastText.git
$ cd fastText
$ wget -c http://mattmahoney.net/dc/enwik9.zip -P data
$ unzip data/enwik9.zip -d data
$ perl wikifil.pl data/enwik9 > data/fil9
$ head -c 35653488  data/fil9 > data/fil9.tiny

For all our experiments we will use:

– gcc (GCC) 9.2.1 20190827 (Red Hat 9.2.1-1)
– Fedora 31
– Linux 5.5.7 kernel
– Intel Core I7 7700HQ
– HyperX Impact 32GB Kit (2x16GB) 2400MHz DDR4 CL14 260-Pin SODIMM Laptop

Training

We will measure training time for a single core. This allows us to get reproducible models, because even with the same random seed it is impossible to get identical results on multiple cores. It will also reduce any uncertainty in our measurements.

Let’s start with a Makefile section, which captures useful information out of the execution process:

bm:


sync; echo 1 > sudo tee /proc/sys/vm/drop_caches
sync; echo 2 > sudo tee /proc/sys/vm/drop_caches
sync; echo 3 > sudo tee /proc/sys/vm/drop_caches
mkdir -p benchmarks /usr/bin/time -p  -o benchmarks/$(BENCHMARK_NAME).time.txt perf record -o benchmarks/$(BENCHMARK_NAME).perf.record  -g ./fasttext skipgram -input data/fil9.tiny -output
result/skipgram.fil9.tiny.1.2147483563.$(BENCHMARK_NAME) -thread 1 -seed 2147483563 -verbose 0 perf stat -o benchmarks/$(BENCHMARK_NAME).perf.stat.v1 -g ./fasttext skipgram -input data/fil9.tiny -output
result/skipgram.fil9.tiny.stat.1.2147483563.$(BENCHMARK_NAME) -thread 1 -seed 2147483563 -verbose 0 perf stat -o benchmarks/$(BENCHMARK_NAME).perf.stat.v2 -g ./fasttext skipgram -input data/fil9.tiny -output
result/skipgram.fil9.tiny.stat.1.2147483563.$(BENCHMARK_NAME) -thread 1 -seed 2147483563 -verbose 0 perf stat -o benchmarks/$(BENCHMARK_NAME).perf.stat.v3 -g ./fasttext skipgram -input data/fil9.tiny -output
result/skipgram.fil9.tiny.stat.1.2147483563.$(BENCHMARK_NAME) -thread 1 -seed 2147483563 -verbose 0 perf script -i benchmarks/$(BENCHMARK_NAME).perf.record >
benchmarks/$(BENCHMARK_NAME).perf.record.script stackcollapse-perf.pl benchmarks/$(BENCHMARK_NAME).perf.record.script >
benchmarks/$(BENCHMARK_NAME).folded flamegraph.pl benchmarks/$(BENCHMARK_NAME).folded > benchmarks/$(BENCHMARK_NAME).folded.svg
stat ./fasttext > benchmarks/$(BENCHMARK_NAME).stat
size ./fasttext > benchmarks/$(BENCHMARK_NAME).size

This script does a few useful things:

  1. Drops caches of the operating system.
  2. Records the execution process with perf.
  3. Keeps track of the performance of counters.
  4. Generates a flame graph from the information, which was recorded by perf.

Experiments

We’ve added -fno-omit-frame-pointer to CXXFLAGS. This allows us to record the call graph. Before making progress, we’ll introduce six groups of compilation flags and add them to the Makefile.

It is possible to use other optimization techniques such as AudoFDO, which is also known as Profile-guided optimization(PGO), but it is out of the scope of the article. Since the training process takes a long time, we didn’t have time to launch all the configs many times. We did ten measurements and took the median. We have not observed serious variance in the results. Let’s measure the speed of training process:

export BENCHMARK_NAME=opt__fb FASTTEXT_CONFIG=fb && make clean && make opt && make bm

This script will generate a few files:

We will generate similar files for each group of CXXFLAGS. The training process takes 312.663296 seconds. Perf had enough time to gather all the useful run-time information from our CPU.  So, we can take a look at our flame graph.

From the flame graph above we can see that most of the time we are doing linear algebra and other mathematical operations. The flame graph is a little bit noisy because Perf tries to be as friendly to our program as possible.

Another perf app, called perf-stat, tells us about task-clock, context-switches, cpu-migrations, page-faults, number of CPU cycles, instructions, branches, branch_misses, and the time of execution. We can find this information in *.perf.stat.v* files.

The simplest way to improve the performance of floating-point operations is to allow the compiler to generate SIMD instructions. In the case of gcc we can add an option -ffast-math. It encapsulates several other optimizations.

But let’s start with a safer operation: link-time optimization. This option runs the standard link-time optimizer. When invoked with source code, it generates GIMPLE (one of GCC’s internal representations) and writes it to special ELF sections in the object file. When the object files are linked together, all the function bodies are read from these ELF sections and instantiated as if they had been part of the same translation unit.

To use the link-time optimizer, -flto and optimization options should be specified at compile time and during the final link. It is recommended that you compile all the files participating in the same link with the same options and also specify those options at link time.

OBJS = args.o autotune.o matrix.o dictionary.o loss.o productquantizer.o densematrix.o quantmatrix.o vector.o model.o utils.o meter.o fasttext.o
INCLUDES = -I.
 
opt-flto: CXXFLAGS += -O3 -funroll-loops -DNDEBUG -flto
opt-flto: fasttext
 
coverage: CXXFLAGS += -O0 -fno-inline -fprofile-arcs --coverage

It is important to know the FastText team added the -DNDEBUG option, and according to our measurements, the absence of this option increases the training time 1.44X. This option discards unnecessary run-time checks, like assert calls. With -flto we won an extra 1.04% in performance, nothing impressive, but keep reading to see some more significant improvements.

Let’s add the fast-math option to the Makefile.

OBJS = args.o autotune.o matrix.o dictionary.o loss.o productquantizer.o densematrix.o quantmatrix.o vector.o model.o utils.o meter.o fasttext.o
INCLUDES = -I.
 
opt: CXXFLAGS += -O3 -funroll-loops -DNDEBUG -flto -ffasth-math
opt: fasttext
 
coverage: CXXFLAGS += -O0 -fno-inline -fprofile-arcs --coverage

With a few keystrokes we’ve achieved 1.29X performance improvement. Let’s gather the final results in a table.

We’ve  sped up the training process by 1.29X, but what about the inference stage? Can we do even better?

Inference

Here comes a hard part — because we will be dealing with microseconds and nanoseconds. With such tiny numbers, we’ll need microbenchmarks to support our results. We’ll use Google Benchmark, a library to benchmark code snippets similar to unit tests, and combine it with  FastText and Perf.  In all cases, the number of iterations for which the benchmark is run is governed by the amount of time the benchmark takes. Concretely, the number of iterations is at least one, not more than 1e9, until CPU time is greater than the minimum time, or the wallclock time is 5x minimum time. The minimum time is set per benchmark by calling MinTime on the registered benchmark object [google benchmark].

For our microbenchmarks, we will use a pretrained model for language identification and test out three functions from the FastText library: 

  1. BM_get_word_vector
  2. BM_get_nn (for this task we use an amazing library)
  3. BM_predict_line (prediction performance in supervised mode)
#include 
#include 
#include "fasttext.h"
 
fasttext::FastText& get_fasttest_model() {
	static bool is_initialized = false;
	static fasttext::FastText fasttext_model;
	if (is_initialized) { return fasttext_model;}
	const char* path_to_fasttext = getenv("FASTTEXT_MODEL");
	if (path_to_fasttext == nullptr) {
       		std::cerr << "There is no modeln";
    		std::exit(1);
	}
	fasttext_model.loadModel(path_to_fasttext);
	return fasttext_model;
}

// the code benchmark

BENCHMARK(BM_get_word_vector);
BENCHMARK(BM_get_nn);
BENCHMARK(BM_predict_line);
BENCHMARK_MAIN();

BM_get_word_vector

For this first function benchmark, we will preload the FastText model and initialize a word vector. We’ll run a loop and search the vector for related words for our given string “happy.”

static void BM_get_word_vector(benchmark::State& state) {
	const auto& fasttext_model = get_fasttest_model();
 
	fasttext::Vector word_vector(fasttext_model.getDimension());
	std::string word = "happy";
 
	for (auto _ : state) {
          	            fasttext_model.getWordVector(word_vector, word);
    		escape(&word_vector);
	}
}

Thanks to these compiler options, we’re already 23% faster without any changes to the code. Now let’s look at the code itself. The original function is full of implicit conversions and contains a constant reference to a temporary object, which forces gcc to copy the object (most of the time we don’t need to do that).

void FastText::getWordVector(Vector& vec, const std::string& word) const {
  const std::vector& ngrams = dict_->getSubwords(word); 
  vec.zero();
  for (int i = 0; i  0) {
	vec.mul(1.0 / ngrams.size());
  }
}

Let’s get rid of implicit conversions and improve the way we’re handling copying the array of ngrams:

void FastText::getWordVector(Vector& vec, const std::string& word) const {
  auto word_id = dict_->getId(word);
  vec.zero();
  if (word_id == -1) {
  	const std::vector ngrams = dict_->getSubwords(word);
  	for (int32_t ngram : ngrams) {
    		addInputVector(vec, ngram);
  	}
  	if (ngrams.size() > 0) {
    	vec.mul(1.0 / ngrams.size());
  	}
  } else {
  	const std::vector& ngrams = dict_->getSubwords(word_id);
  	for (int32_t ngram : ngrams) {
    		addInputVector(vec, ngram);
  	}
  	if (ngrams.size() > 0) {
    	vec.mul(1.0 / ngrams.size());
  	}
  }
}

This, along with the right set of compiler options, gave us 1.48X speed up.

BM_get_nn

Here is the function for this benchmark:

static void BM_get_nn(benchmark::State& state) {
	auto& fasttext_model = get_fasttest_model();
	int32_t k = 10;
	const std::string word = "happy";
	for (auto _ : state) {
    	std::vector<std::pair> results = fasttext_model.getNN(word, k);
    	escape(&results);
	}
}

Similar to the previous case, we can improve the performance without any code changes using compiler flags:

Can we do better? Of course! Method FastText::getNN takes a std::set as the last argument. We don’t need it in our scenario, so we can get 2.13X speed up instead of 1.22X:

std::vector<std::pair> getNN(
   	const DenseMatrix& wordVectors,
   	const Vector& queryVec,
   	int32_t k,
   	const std::set& banSet);

Std::set is implemented as a red-black tree. Red-black-trees searches have logarithmic complexity, and are slower than binary search and hash-table lookup. In our scenarios we don’t need any look-ups, so we can simplify the original code.

std::vector<std::pair> FastText::getNN(
	const DenseMatrix& wordVectors,
	const Vector& query,
	int32_t k,
	const std::string& word) {
  ++k;
  std::vector<std::pair> heap;
  heap.reserve(k);
  int32_t nwords = dict_->nwords();
 
  int32_t i = 0;
  for (; i < nwords && heap.size() getWord(i);
	heap.emplace_back(wordVectors.dotRow(query, i), std::move(word));
  }
  std::make_heap(heap.begin(), heap.end(), [](const auto& x, const auto& y) { return x.first > y.first; });
 
  for (; i getWord(i);
	real similarity = wordVectors.dotRow(query, i);
	if (similarity >= heap.front().first) {
    	std::pop_heap(heap.begin(), heap.end(), [](const auto& x, const auto& y) { return x.first > y.first; });
    	heap.pop_back();
    	heap.emplace_back(similarity, std::move(word));
    	std::push_heap(heap.begin(), heap.end(), [](const auto& x, const auto& y) { return x.first > y.first; });
	}
  }
  std::sort(heap.begin(), heap.end(), [](const auto& x, const auto& y) { return x.first > y.first; });
  std::remove_if(heap.begin(), heap.end(), [&](const auto& pair) { return pair.second == word; });
  heap.pop_back();
 
  real queryNorm = query.norm();
  if (std::abs(queryNorm) < 1e-8) {
	queryNorm = 1;
  }
 
  for (auto& word_similarity : heap) {
	word_similarity.first /= queryNorm;
  }
 
  return heap;
}

And the results….

Now we don’t do string comparison for every word from our vocabulary, and, instead of one complicated loop, we have multiple simpler loops. Here’s the breakdown:

  1. Build a heap.
  2. Use this heap to get top k + 1 results.
  3. Sort the heap.
  4. Remove the query from the results.
  5. Normalize similarities.

We could improve this code even further by replacing the heap with the nth_element function call, but there is a CPU vs Memory trade-off. Since the vocabulary can be huge, we decided to keep the heap-based implementation, which uses less RAM. In the next section we will see how to get a faster algorithm for partial sort.

BM_predict_line

Again, we can speed this function up just with compiler flags and achieve a 1.25X speed up:

In the code, we can change HierarchicalSoftmaxLoss::predict function to give us a speed boost:

void HierarchicalSoftmaxLoss::predict(
	int32_t k,
	real threshold,
	Predictions& heap,
	Model::State& state) const {
  dfs(k, threshold, 2 * osz_ - 2, 0.0, heap, state.hidden);
#ifdef GIPHY  
  auto middle = heap.begin() + std::min(heap.size(), size_t(k));
  std::nth_element(heap.begin(), middle, heap.end(), [](const auto& x, const auto& y) { return x.first > y.first; });
  heap.resize(middle - heap.begin());
  std::sort(heap.begin(), heap.end(), [](const auto& x, const auto& y) { return x.first > y.first; });
#else
  std::sort_heap(heap.begin(), heap.end(), comparePairs);
#endif
}

And we can get rid of the heap in dfs within the implementation of Hierarchical Softmax Loss:

void HierarchicalSoftmaxLoss::dfs(
	int32_t k,
	real threshold,
	int32_t node,
	real score,
	Predictions& heap,
	const Vector& hidden) const {
  if (score dotRow(hidden, node - osz_);
  f = 1. / (1 + std::exp(-f));
  dfs(k, threshold, tree_[node].left, score + std_log(1.0 - f), heap, hidden);
  dfs(k, threshold, tree_[node].right, score + std_log(f), heap, hidden);
}

We don’t need a heap sort to get top K predictions; it’s slow, and we don’t need to save memory. Instead, we can load the data and use std::nth_element and std::sort to implement partial sort. With a few code changes we’ve achieved 1.48X speed up instead of 1.25X.

How can you repeat the results?

For microbenchmarks:
$ # Download a model for language identification
$ wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin
$ bash scripts/run_microbenchmarks.bash
$ # Observe the results in Jupyter Notebook
$ jupyter notebook  # Open Benchmark.micro.ipynb

For training experiments:
$ wget -c http://mattmahoney.net/dc/enwik9.zip -P data
$ unzip data/enwik9.zip -d data
$ perl wikifil.pl data/enwik9 > data/fil9
$ head -c 35653488  data/fil9 > data/fil9.tiny
$ bash scripts/run_macrobenchmarks.bash

Conclusion

As you can see, choosing the right compiler flags can have a massive impact on performance. Sometimes that’s good enough, but double-checking your algorithms can also help you squeeze out even better enhancements. Algorithms with the same asymptotic complexity can have dramatically different performance, so it is important to use cache-friendly algorithms. Our improvements allowed us to get:

– Faster Training
– Faster Inference Process
– Faster Hyperparameter tuning, which can lead to better models

Here are the final numbers:

Performance enhancements like these are crucial when dealing with huge datasets that need recurring or real-time analysis. Let us know if you were able to utilize these recommendations by reaching out to us on twitter (@giphyeng).

Source: GIPHY