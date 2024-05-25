Show Your Support: This site is primarily supported by advertisements. Ads are what have allowed this site to be maintained on a daily basis for the past 19+ years. We do our best to ensure only clean, relevant ads are shown, when any nasty ads are detected, we work to remove them ASAP. If you would like to view the site without ads while still supporting our work, please consider our ad-free Phoronix Premium.
Llamafile 0.8.5 Delivers Greater Performance: Tiny Models 2x Faster On Threadripper
Llamafile 0.8.5 is the newest version and delivers on yet more performance tuning... On top of the recent work around AVX2 optimizations, more AMD GPU offloading, and other work. Justine Tunney explained of the latest performance work in Llamafile 0.8.5:
"As of #435 the K quants now go consistently 2x faster than llama.cpp upstream. On big CPUs like Threadripper we've doubled the performance of tiny models, for both prompt processing and token generation for tiny models."
Doubling the performance for tiny models on AMD Ryzen Threadripper class hardware!
HP Z6 G5 A with AMD Ryzen Threadripper PRO 7000 series
Llamafile 0.8.5 also delivers faster AVX2 matrix multiplication for MoE models and legacy quants. There are also some AMD Zen 4 performance optimizations, BF16 NVIDIA CUDA support, and other improvements.
Downloads and more details on the Llamafile 0.8.5 release via GitHub. I'll be working on new LLamafile benchmarks soon.