llamafile Prompt Processing tests return invalid results

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • fairydreaming
    Junior Member
    • Oct 2024
    • 9

    llamafile Prompt Processing tests return invalid results

    I noticed that llamafile Prompt Processing test results always contain the same values regardless on the benchmarked hardware. For example in pts/llamafile-1.3.x - Model: TinyLlama-1.1B-Chat-v1.0.BF16 - Test: Prompt Processing 512 test configuration you can see:
    pts-llamafile-prompt-processing-bug.png
    Note that tokens per second value is always 8192 here. My guess is that there is some parsing problem with llama-bench output and the number of processed tokens is used as test result instead of prompt processing (prompt eval time) tokens per second value.

    Michael I think it's a bad idea to parse llama-bench verbose output (-v) to gather results. You can instead use llama-bench -o option to select the output format (for example -o json or -o csv) and parse this output, then the tokens per second value will always be returned in "avg_ts" field. This should make parsing easier. There is also standard deviation and individual samples in this output in case you want to use them. Example json output:

    Code:
    $ ./o/llama.cpp/llama-bench/llama-bench -o json -t 32 -m ~/projects/llama.cpp/models/wizardcoder-python-34b-v1.0.Q6_K.gguf -p 256 -n 0
    warning: don't know how to govern your cpu temperature; consider setting the environment variables described in llamafile/govern.cpp
    [
      {
        "build_commit": "a30b324",
        "build_number": 1500,
        "cuda": false,
        "opencl": false,
        "vulkan": false,
        "kompute": false,
        "metal": false,
        "sycl": false,
        "gpu_blas": false,
        "blas": false,
        "cpu_info": "AMD EPYC 9374F 32-Core Processor (znver4)",
        "gpu_info": "",
        "model_filename": "wizardcoder-python-34b-v1.0.Q6_K.gguf",
        "model_type": "llama 34B Q6_K",
        "model_size": 27683140736,
        "model_n_params": 33743986688,
        "n_batch": 2048,
        "n_ubatch": 512,
        "n_threads": 32,
        "type_k": "bf16",
        "type_v": "bf16",
        "n_gpu_layers": 0,
        "split_mode": "layer",
        "main_gpu": 0,
        "no_kv_offload": false,
        "flash_attn": false,
        "tensor_split": "0.00",
        "use_mmap": true,
        "embeddings": false,
        "n_prompt": 256,
        "n_gen": 0,
        "test_time": "2025-01-14T20:29:20Z",
        "avg_ns": 3549106833,
        "stddev_ns": 2050444,
        "avg_ts": 72.130840,
        "stddev_ts": 0.041684,
        "samples_ns": [ 3549898633, 3550643318, 3546778548 ],
        "samples_ts": [ 72.1147, 72.0996, 72.1782 ]
      }
    ]
Working...
X