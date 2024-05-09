Mozilla's Llamafile 0.8.2 Scores Big With New AVX2 Performance Optimizations

Written by Michael Larabel in Mozilla on 9 May 2024 at 08:15 PM EDT. 1 Comment
MOZILLA
One of the interesting innovations out of Mozilla Ocho as the browser company's innovation and experiments group is Llamafile, a easy way to distribute and run AI large language models (LLMs) from a single file. Out this evening is Llamafile 0.8.2 is the newest release with an updated Llama.cpp and most excitingly are some AVX2 performance optimizations.

Llamafile aims to make AI LLMs more accessible to users and developers by supporting streamlined deployments of large language models from a single file that can work with both CPU and GPU execution as well as across platforms. Llamafile has already supported leveraging AVX/AVX2 for faster performance as well as AVX-512 support for even greater speed-ups. With today's Llamafile 0.8.2 release there are additional AVX2 optimizations.

The Llamafile 0.8.2 release notes mention:
"This release introduces faster AVX2 prompt processing for K-quants and IQ4_XS. This was contributed to llamafile by @ikawrakow who originally invented K quants last year: ggerganov/llama.cpp@99009e7. In prior releases we recommended the legacy Q4_0 quant since it was the simplest and most intuitive to get working with recent matmul optimizations. Thanks to Iwan Kawrakow's efforts, the best quants (e.g. Q5_K_M) will now go the fastest (on modern x86 systems)."

Advanced Vector Extensions 2 is widely supported across Intel and AMD processors for the past number of years: most Intel CPUs over the past decade since Haswell or on the AMD side since Excavator CPUs.

The pull request notes some exciting gains for faster AVX2 prompt processing. Reported speed-ups were in the 1.4~2.3x range for various quants.

Llamafile 0.8.2 AVX2 gains


Justine Tunney who is heavily involved with Llamafile development initially responded to that pull request:
"This is a remarkable change @ikawrakow. I'm very happy to see that the best quantized formats will now go the fastest. For prompt processing, I'm consistently seeing speedups between 1.2x - 2.0x on x86-64 machines. You even managed to make token generation go faster (which I've found much more difficult), in some cases by as much as 1.33x!"


These AVX2 optimizations for prompt processing are exciting enough alone for Llamafile 0.8.2. But this v0.8.2 release also brings a memory bug fix, slight performance optimizations to text generation, updates against the Llama.cpp code as of this week, and various new flags.

Downloads and more details on the Llamafile 0.8.2 release via GitHub. New Llamafile benchmarks against the new version soon.
1 Comment
Related News
Mozilla Has Been Rewriting Its Crash Reporter In Rust
Mozilla Finally Begins Offering Firefox ARM64 Linux Binaries
Firefox 125 Adds AV1 Support In Encrypted Media Extensions, Other New Features
Firefox 124 Now Available With Screen Wake Lock API
Mozilla Firefox 123.0 Available With Improved Translation Support, New Developer Features
Thunderbird Making Progress With Adopting Rust Code
About The Author
Michael Larabel

Michael Larabel is the principal author of Phoronix.com and founded the site in 2004 with a focus on enriching the Linux hardware experience. Michael has written more than 20,000 articles covering the state of Linux hardware support, Linux performance, graphics drivers, and other topics. Michael is also the lead developer of the Phoronix Test Suite, Phoromatic, and OpenBenchmarking.org automated benchmarking software. He can be followed via Twitter, LinkedIn, or contacted via MichaelLarabel.com.

Popular News This Week
Microsoft Updates Cascadia Code: Its Open-Source Font For Developers
NetBSD On The State & Future Of X.Org/X11
Framework Laptop EC Driver Being Prepared For Linux
Valve Working On Explicit Sync Support For "NVK" NVIDIA Vulkan Driver
Wine 9.8 Fixes Nearly 20 Year Old Bug For Installing Microsoft Office 97
GNOME Shell's Layout Being Improved For Smaller Displays
Valve Publishes Steam Survey Numbers For April 2024
Punting GPU Drivers From The Initramfs Due To Ever Increasing Firmware Bloat