Llamafile 0.8.7 Brings Fixes, Better ARM Performance & Preps For New Server

Written by Michael Larabel in Mozilla on 24 June 2024 at 11:11 AM EDT. 6 Comments

Llamafile has been one of the better new initiatives out of Mozilla in recent years. Llamafile makes it easy to conveniently distribute and run large language models as a single file while supporting both CPU and GPU execution and all-around making AI LLMs much more approachable for end-users. Out today is Llamafile 0.8.7 with more performance optimizations and new features.

After recent Llamafile releases have been tuning the Intel/AMD AVX performance, today's Llamafile 0.8.7 release brings some ARM performance improvements. There is better performance on Arm for legacy and K-quants while also bringing optimized matrix multiplication for I-quants on AArch64.

Llamafile 0.8.7 also fixes some AMD GPU issues on Windows by now always using tinyBLAS there, improved CPU brand detection, and other fixes.

Llamafile logo

Moving forward, a new Llamafile server is preparing to roll-out. Justine Tunney mentioned in the v0.8.7 release announcement on GitHub:

"It should be noted that, in future releases, we plan to introduce a new server for llamafile. This new server is being designed for performance and production-worthiness. It's not included in this release, since the new server currently only supports a tokenization endpoint. However the endpoint is capable of doing 2 million requests per second whereas with the current server, the most we've ever seen is a few thousand."

This patch adding the new Llamafile server notes that it is not only much faster than before but also designed to be crash-proof, reliable, and preempting.

Llamafile continues looking great for easy to distribute and run large language models. Learn more about this open-source project via Llamafile.ai.

6 Comments