Mozilla's Llamafile 0.8.2 Scores Big With New AVX2 Performance Optimizations

Written by Michael Larabel in Mozilla on 9 May 2024 at 08:15 PM EDT. 58 Comments

One of the interesting innovations out of Mozilla Ocho as the browser company's innovation and experiments group is Llamafile, a easy way to distribute and run AI large language models (LLMs) from a single file. Out this evening is Llamafile 0.8.2 is the newest release with an updated Llama.cpp and most excitingly are some AVX2 performance optimizations.

Llamafile aims to make AI LLMs more accessible to users and developers by supporting streamlined deployments of large language models from a single file that can work with both CPU and GPU execution as well as across platforms. Llamafile has already supported leveraging AVX/AVX2 for faster performance as well as AVX-512 support for even greater speed-ups. With today's Llamafile 0.8.2 release there are additional AVX2 optimizations.

The Llamafile 0.8.2 release notes mention:

"This release introduces faster AVX2 prompt processing for K-quants and IQ4_XS. This was contributed to llamafile by @ikawrakow who originally invented K quants last year: ggerganov/llama.cpp@99009e7. In prior releases we recommended the legacy Q4_0 quant since it was the simplest and most intuitive to get working with recent matmul optimizations. Thanks to Iwan Kawrakow's efforts, the best quants (e.g. Q5_K_M) will now go the fastest (on modern x86 systems)."

Advanced Vector Extensions 2 is widely supported across Intel and AMD processors for the past number of years: most Intel CPUs over the past decade since Haswell or on the AMD side since Excavator CPUs.

The pull request notes some exciting gains for faster AVX2 prompt processing. Reported speed-ups were in the 1.4~2.3x range for various quants.

Llamafile 0.8.2 AVX2 gains

Justine Tunney who is heavily involved with Llamafile development initially responded to that pull request:

"This is a remarkable change @ikawrakow. I'm very happy to see that the best quantized formats will now go the fastest. For prompt processing, I'm consistently seeing speedups between 1.2x - 2.0x on x86-64 machines. You even managed to make token generation go faster (which I've found much more difficult), in some cases by as much as 1.33x!"

These AVX2 optimizations for prompt processing are exciting enough alone for Llamafile 0.8.2. But this v0.8.2 release also brings a memory bug fix, slight performance optimizations to text generation, updates against the Llama.cpp code as of this week, and various new flags.

Downloads and more details on the Llamafile 0.8.2 release via GitHub. New Llamafile benchmarks against the new version soon.

58 Comments