I noticed that llamafile Prompt Processing test results always contain the same values regardless on the benchmarked hardware. For example in pts/llamafile-1.3.x - Model: TinyLlama-1.1B-Chat-v1.0.BF16 - Test: Prompt Processing 512 test configuration you can see:
pts-llamafile-prompt-processing-bug.png
Note that tokens per second value is always 8192 here. My guess is that there is some parsing problem with llama-bench output and the number of processed tokens is used as test result instead of prompt processing (prompt eval time) tokens per second value.
Michael I think it's a bad idea to parse llama-bench verbose output (-v) to gather results. You can instead use llama-bench -o option to select the output format (for example -o json or -o csv) and parse this output, then the tokens per second value will always be returned in "avg_ts" field. This should make parsing easier. There is also standard deviation and individual samples in this output in case you want to use them. Example json output:
pts-llamafile-prompt-processing-bug.png
Note that tokens per second value is always 8192 here. My guess is that there is some parsing problem with llama-bench output and the number of processed tokens is used as test result instead of prompt processing (prompt eval time) tokens per second value.
Michael I think it's a bad idea to parse llama-bench verbose output (-v) to gather results. You can instead use llama-bench -o option to select the output format (for example -o json or -o csv) and parse this output, then the tokens per second value will always be returned in "avg_ts" field. This should make parsing easier. There is also standard deviation and individual samples in this output in case you want to use them. Example json output:
Code:
$ ./o/llama.cpp/llama-bench/llama-bench -o json -t 32 -m ~/projects/llama.cpp/models/wizardcoder-python-34b-v1.0.Q6_K.gguf -p 256 -n 0 warning: don't know how to govern your cpu temperature; consider setting the environment variables described in llamafile/govern.cpp [ { "build_commit": "a30b324", "build_number": 1500, "cuda": false, "opencl": false, "vulkan": false, "kompute": false, "metal": false, "sycl": false, "gpu_blas": false, "blas": false, "cpu_info": "AMD EPYC 9374F 32-Core Processor (znver4)", "gpu_info": "", "model_filename": "wizardcoder-python-34b-v1.0.Q6_K.gguf", "model_type": "llama 34B Q6_K", "model_size": 27683140736, "model_n_params": 33743986688, "n_batch": 2048, "n_ubatch": 512, "n_threads": 32, "type_k": "bf16", "type_v": "bf16", "n_gpu_layers": 0, "split_mode": "layer", "main_gpu": 0, "no_kv_offload": false, "flash_attn": false, "tensor_split": "0.00", "use_mmap": true, "embeddings": false, "n_prompt": 256, "n_gen": 0, "test_time": "2025-01-14T20:29:20Z", "avg_ns": 3549106833, "stddev_ns": 2050444, "avg_ts": 72.130840, "stddev_ts": 0.041684, "samples_ns": [ 3549898633, 3550643318, 3546778548 ], "samples_ts": [ 72.1147, 72.0996, 72.1782 ] } ]