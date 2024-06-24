Show Your Support: Did you know that you can get Phoronix Premium for under $4 per month? Try it today to view our site ad-free, multi-page articles on a single page, and more while the proceeds allow us to write more Linux hardware reviews. At the very least, please disable your ad-blocker.
Llamafile 0.8.7 Brings Fixes, Better ARM Performance & Preps For New Server
After recent Llamafile releases have been tuning the Intel/AMD AVX performance, today's Llamafile 0.8.7 release brings some ARM performance improvements. There is better performance on Arm for legacy and K-quants while also bringing optimized matrix multiplication for I-quants on AArch64.
Llamafile 0.8.7 also fixes some AMD GPU issues on Windows by now always using tinyBLAS there, improved CPU brand detection, and other fixes.
Moving forward, a new Llamafile server is preparing to roll-out. Justine Tunney mentioned in the v0.8.7 release announcement on GitHub:
"It should be noted that, in future releases, we plan to introduce a new server for llamafile. This new server is being designed for performance and production-worthiness. It's not included in this release, since the new server currently only supports a tokenization endpoint. However the endpoint is capable of doing 2 million requests per second whereas with the current server, the most we've ever seen is a few thousand."
This patch adding the new Llamafile server notes that it is not only much faster than before but also designed to be crash-proof, reliable, and preempting.
Llamafile continues looking great for easy to distribute and run large language models. Learn more about this open-source project via Llamafile.ai.