Skip to content

Conversation

@ggerganov
Copy link
Member

ref #7254

Reduce KV cache padding from 256 to 32 when FA is disabled

@mofosyne mofosyne added Review Complexity : High Generally require indepth knowledge of LLMs or GPUs refactoring Refactoring labels May 13, 2024
@mofosyne mofosyne added Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level and removed Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 13, 2024
@slaren
Copy link
Member

slaren commented May 13, 2024

Minor, but n_ctx is also padded to 256. Nvm.

@ggerganov ggerganov merged commit 614d3b9 into master May 13, 2024
@ggerganov ggerganov deleted the gg/fa-pad branch May 13, 2024 14:15
@github-actions
Copy link
Contributor

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 544 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8628.72ms p(95)=20522.66ms fails=, finish reason: stop=487 truncated=57
  • Prompt processing (pp): avg=92.69tk/s p(95)=373.39tk/s
  • Token generation (tg): avg=37.38tk/s p(95)=46.72tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=gg/fa-pad commit=cbca75cb2c6ff9371f57a024390be5610fd39f28

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 544 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715609591 --> 1715610219
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 468.45, 468.45, 468.45, 468.45, 468.45, 870.32, 870.32, 870.32, 870.32, 870.32, 883.17, 883.17, 883.17, 883.17, 883.17, 876.46, 876.46, 876.46, 876.46, 876.46, 926.56, 926.56, 926.56, 926.56, 926.56, 914.32, 914.32, 914.32, 914.32, 914.32, 923.7, 923.7, 923.7, 923.7, 923.7, 910.22, 910.22, 910.22, 910.22, 910.22, 904.5, 904.5, 904.5, 904.5, 904.5, 916.37, 916.37, 916.37, 916.37, 916.37, 938.57, 938.57, 938.57, 938.57, 938.57, 959.33, 959.33, 959.33, 959.33, 959.33, 966.93, 966.93, 966.93, 966.93, 966.93, 979.71, 979.71, 979.71, 979.71, 979.71, 981.22, 981.22, 981.22, 981.22, 981.22, 975.63, 975.63, 975.63, 975.63, 975.63, 964.92, 964.92, 964.92, 964.92, 964.92, 959.43, 959.43, 959.43, 959.43, 959.43, 934.38, 934.38, 934.38, 934.38, 934.38, 933.58, 933.58, 933.58, 933.58, 933.58, 937.36, 937.36, 937.36, 937.36, 937.36, 939.14, 939.14, 939.14, 939.14, 939.14, 955.85, 955.85, 955.85, 955.85, 955.85, 950.32, 950.32, 950.32, 950.32, 950.32, 951.32, 951.32, 951.32, 951.32, 951.32, 962.42, 962.42, 962.42, 962.42, 962.42, 958.14, 958.14, 958.14, 958.14, 958.14, 955.91, 955.91, 955.91, 955.91, 955.91, 953.64, 953.64, 953.64, 953.64, 953.64, 955.78, 955.78, 955.78, 955.78, 955.78, 955.05, 955.05, 955.05, 955.05, 955.05, 952.97, 952.97, 952.97, 952.97, 952.97, 949.77, 949.77, 949.77, 949.77, 949.77, 947.52, 947.52, 947.52, 947.52, 947.52, 939.74, 939.74, 939.74, 939.74, 939.74, 943.12, 943.12, 943.12, 943.12, 943.12, 944.49, 944.49, 944.49, 944.49, 944.49, 941.73, 941.73, 941.73, 941.73, 941.73, 939.8, 939.8, 939.8, 939.8, 939.8, 940.02, 940.02, 940.02, 940.02, 940.02, 940.58, 940.58, 940.58, 940.58, 940.58, 940.35, 940.35, 940.35, 940.35, 940.35, 948.64, 948.64, 948.64, 948.64, 948.64, 911.91, 911.91, 911.91, 911.91, 911.91, 850.98, 850.98, 850.98, 850.98, 850.98, 850.19, 850.19, 850.19, 850.19, 850.19, 847.99, 847.99, 847.99, 847.99, 847.99, 852.53, 852.53, 852.53, 852.53, 852.53, 853.35, 853.35, 853.35, 853.35, 853.35, 852.27, 852.27, 852.27, 852.27, 852.27, 850.41, 850.41, 850.41, 850.41, 850.41, 850.61, 850.61, 850.61, 850.61, 850.61, 853.12, 853.12, 853.12, 853.12, 853.12, 854.09, 854.09, 854.09, 854.09, 854.09, 855.8, 855.8, 855.8, 855.8, 855.8, 859.7, 859.7, 859.7, 859.7, 859.7, 860.42, 860.42, 860.42, 860.42, 860.42, 859.19, 859.19, 859.19, 859.19, 859.19, 859.65, 859.65, 859.65, 859.65, 859.65, 861.47, 861.47, 861.47, 861.47, 861.47, 864.08, 864.08, 864.08, 864.08, 864.08]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 544 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715609591 --> 1715610219
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 42.26, 42.26, 42.26, 42.26, 42.26, 39.8, 39.8, 39.8, 39.8, 39.8, 30.69, 30.69, 30.69, 30.69, 30.69, 33.76, 33.76, 33.76, 33.76, 33.76, 34.51, 34.51, 34.51, 34.51, 34.51, 34.59, 34.59, 34.59, 34.59, 34.59, 35.11, 35.11, 35.11, 35.11, 35.11, 35.31, 35.31, 35.31, 35.31, 35.31, 35.31, 35.31, 35.31, 35.31, 35.31, 34.91, 34.91, 34.91, 34.91, 34.91, 34.56, 34.56, 34.56, 34.56, 34.56, 33.63, 33.63, 33.63, 33.63, 33.63, 32.96, 32.96, 32.96, 32.96, 32.96, 32.96, 32.96, 32.96, 32.96, 32.96, 32.21, 32.21, 32.21, 32.21, 32.21, 30.72, 30.72, 30.72, 30.72, 30.72, 29.91, 29.91, 29.91, 29.91, 29.91, 30.11, 30.11, 30.11, 30.11, 30.11, 30.33, 30.33, 30.33, 30.33, 30.33, 30.06, 30.06, 30.06, 30.06, 30.06, 30.17, 30.17, 30.17, 30.17, 30.17, 30.42, 30.42, 30.42, 30.42, 30.42, 30.63, 30.63, 30.63, 30.63, 30.63, 30.49, 30.49, 30.49, 30.49, 30.49, 30.68, 30.68, 30.68, 30.68, 30.68, 30.92, 30.92, 30.92, 30.92, 30.92, 30.65, 30.65, 30.65, 30.65, 30.65, 30.42, 30.42, 30.42, 30.42, 30.42, 30.48, 30.48, 30.48, 30.48, 30.48, 30.73, 30.73, 30.73, 30.73, 30.73, 30.82, 30.82, 30.82, 30.82, 30.82, 30.84, 30.84, 30.84, 30.84, 30.84, 30.87, 30.87, 30.87, 30.87, 30.87, 30.93, 30.93, 30.93, 30.93, 30.93, 30.96, 30.96, 30.96, 30.96, 30.96, 30.86, 30.86, 30.86, 30.86, 30.86, 30.63, 30.63, 30.63, 30.63, 30.63, 30.5, 30.5, 30.5, 30.5, 30.5, 30.46, 30.46, 30.46, 30.46, 30.46, 30.53, 30.53, 30.53, 30.53, 30.53, 30.6, 30.6, 30.6, 30.6, 30.6, 30.68, 30.68, 30.68, 30.68, 30.68, 30.6, 30.6, 30.6, 30.6, 30.6, 30.6, 30.6, 30.6, 30.6, 30.6, 30.26, 30.26, 30.26, 30.26, 30.26, 30.21, 30.21, 30.21, 30.21, 30.21, 29.44, 29.44, 29.44, 29.44, 29.44, 29.39, 29.39, 29.39, 29.39, 29.39, 29.36, 29.36, 29.36, 29.36, 29.36, 29.3, 29.3, 29.3, 29.3, 29.3, 29.34, 29.34, 29.34, 29.34, 29.34, 29.41, 29.41, 29.41, 29.41, 29.41, 29.48, 29.48, 29.48, 29.48, 29.48, 29.48, 29.48, 29.48, 29.48, 29.48, 29.41, 29.41, 29.41, 29.41, 29.41, 29.37, 29.37, 29.37, 29.37, 29.37, 29.47, 29.47, 29.47, 29.47, 29.47, 29.67, 29.67, 29.67, 29.67, 29.67, 29.78, 29.78, 29.78, 29.78, 29.78, 29.9, 29.9, 29.9, 29.9, 29.9, 29.96, 29.96, 29.96, 29.96, 29.96]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 544 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715609591 --> 1715610219
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.13, 0.13, 0.13, 0.13, 0.13, 0.32, 0.32, 0.32, 0.32, 0.32, 0.15, 0.15, 0.15, 0.15, 0.15, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.24, 0.24, 0.24, 0.24, 0.24, 0.18, 0.18, 0.18, 0.18, 0.18, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.26, 0.26, 0.26, 0.26, 0.26, 0.32, 0.32, 0.32, 0.32, 0.32, 0.18, 0.18, 0.18, 0.18, 0.18, 0.31, 0.31, 0.31, 0.31, 0.31, 0.41, 0.41, 0.41, 0.41, 0.41, 0.27, 0.27, 0.27, 0.27, 0.27, 0.16, 0.16, 0.16, 0.16, 0.16, 0.07, 0.07, 0.07, 0.07, 0.07, 0.31, 0.31, 0.31, 0.31, 0.31, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.33, 0.33, 0.33, 0.33, 0.33, 0.07, 0.07, 0.07, 0.07, 0.07, 0.13, 0.13, 0.13, 0.13, 0.13, 0.31, 0.31, 0.31, 0.31, 0.31, 0.23, 0.23, 0.23, 0.23, 0.23, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.22, 0.22, 0.22, 0.22, 0.22, 0.21, 0.21, 0.21, 0.21, 0.21, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.29, 0.29, 0.29, 0.29, 0.29, 0.26, 0.26, 0.26, 0.26, 0.26, 0.33, 0.33, 0.33, 0.33, 0.33, 0.2, 0.2, 0.2, 0.2, 0.2, 0.24, 0.24, 0.24, 0.24, 0.24, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.19, 0.19, 0.19, 0.19, 0.19, 0.27, 0.27, 0.27, 0.27, 0.27, 0.5, 0.5, 0.5, 0.5, 0.5, 0.46, 0.46, 0.46, 0.46, 0.46, 0.49, 0.49, 0.49, 0.49, 0.49, 0.11, 0.11, 0.11, 0.11, 0.11, 0.27, 0.27, 0.27, 0.27, 0.27, 0.3, 0.3, 0.3, 0.3, 0.3, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.24, 0.24, 0.24, 0.24, 0.24, 0.25, 0.25, 0.25, 0.25, 0.25, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.08, 0.08, 0.08, 0.08, 0.08, 0.15, 0.15, 0.15, 0.15, 0.15, 0.18, 0.18, 0.18, 0.18, 0.18]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 544 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715609591 --> 1715610219
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0]
                    
Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

refactoring Refactoring Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants