Skip to content

Conversation

@ibehnam
Copy link
Contributor

@ibehnam ibehnam commented Mar 28, 2024

68e210b enabled continuous batching by default, but the server would still take the -cb | --cont-batching to set the continuous batching to true. I turned those args to -nocb | --no-cont-batching so we can disable this behavior in server.

@phymbert
Copy link
Collaborator

What is the motivation to disable continuous batching?

Copy link
Collaborator

@phymbert phymbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the Readme and related tests

@mofosyne mofosyne added Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level enhancement New feature or request labels May 10, 2024
@github-actions
Copy link
Contributor

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 557 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8399.19ms p(95)=20545.17ms fails=, finish reason: stop=495 truncated=62
  • Prompt processing (pp): avg=97.95tk/s p(95)=454.09tk/s
  • Token generation (tg): avg=33.27tk/s p(95)=48.25tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=master commit=472a9b8be53dba6864411327c64f5fdd636c2196

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 557 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715474936 --> 1715475558
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 394.51, 394.51, 394.51, 394.51, 394.51, 711.32, 711.32, 711.32, 711.32, 711.32, 647.0, 647.0, 647.0, 647.0, 647.0, 686.03, 686.03, 686.03, 686.03, 686.03, 746.89, 746.89, 746.89, 746.89, 746.89, 758.78, 758.78, 758.78, 758.78, 758.78, 759.53, 759.53, 759.53, 759.53, 759.53, 784.22, 784.22, 784.22, 784.22, 784.22, 789.5, 789.5, 789.5, 789.5, 789.5, 805.2, 805.2, 805.2, 805.2, 805.2, 831.98, 831.98, 831.98, 831.98, 831.98, 841.31, 841.31, 841.31, 841.31, 841.31, 838.13, 838.13, 838.13, 838.13, 838.13, 850.4, 850.4, 850.4, 850.4, 850.4, 774.82, 774.82, 774.82, 774.82, 774.82, 780.21, 780.21, 780.21, 780.21, 780.21, 779.83, 779.83, 779.83, 779.83, 779.83, 802.66, 802.66, 802.66, 802.66, 802.66, 800.25, 800.25, 800.25, 800.25, 800.25, 803.11, 803.11, 803.11, 803.11, 803.11, 809.66, 809.66, 809.66, 809.66, 809.66, 809.42, 809.42, 809.42, 809.42, 809.42, 814.0, 814.0, 814.0, 814.0, 814.0, 779.39, 779.39, 779.39, 779.39, 779.39, 783.81, 783.81, 783.81, 783.81, 783.81, 786.42, 786.42, 786.42, 786.42, 786.42, 784.28, 784.28, 784.28, 784.28, 784.28, 784.0, 784.0, 784.0, 784.0, 784.0, 783.01, 783.01, 783.01, 783.01, 783.01, 785.95, 785.95, 785.95, 785.95, 785.95, 790.89, 790.89, 790.89, 790.89, 790.89, 789.79, 789.79, 789.79, 789.79, 789.79, 795.04, 795.04, 795.04, 795.04, 795.04, 799.24, 799.24, 799.24, 799.24, 799.24, 808.28, 808.28, 808.28, 808.28, 808.28, 810.2, 810.2, 810.2, 810.2, 810.2, 809.34, 809.34, 809.34, 809.34, 809.34, 807.69, 807.69, 807.69, 807.69, 807.69, 810.69, 810.69, 810.69, 810.69, 810.69, 813.76, 813.76, 813.76, 813.76, 813.76, 814.13, 814.13, 814.13, 814.13, 814.13, 796.72, 796.72, 796.72, 796.72, 796.72, 769.66, 769.66, 769.66, 769.66, 769.66, 766.49, 766.49, 766.49, 766.49, 766.49, 765.62, 765.62, 765.62, 765.62, 765.62, 763.87, 763.87, 763.87, 763.87, 763.87, 772.57, 772.57, 772.57, 772.57, 772.57, 773.41, 773.41, 773.41, 773.41, 773.41, 779.12, 779.12, 779.12, 779.12, 779.12, 779.34, 779.34, 779.34, 779.34, 779.34, 784.15, 784.15, 784.15, 784.15, 784.15, 788.11, 788.11, 788.11, 788.11, 788.11, 787.57, 787.57, 787.57, 787.57, 787.57, 795.57, 795.57, 795.57, 795.57, 795.57, 796.92, 796.92, 796.92, 796.92, 796.92, 797.87, 797.87, 797.87, 797.87, 797.87, 799.61, 799.61, 799.61, 799.61, 799.61, 801.45, 801.45, 801.45, 801.45, 801.45, 804.97, 804.97, 804.97, 804.97, 804.97, 805.4, 805.4, 805.4, 805.4, 805.4, 805.71, 805.71]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 557 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715474936 --> 1715475558
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 38.86, 38.86, 38.86, 38.86, 38.86, 38.25, 38.25, 38.25, 38.25, 38.25, 30.23, 30.23, 30.23, 30.23, 30.23, 31.38, 31.38, 31.38, 31.38, 31.38, 32.34, 32.34, 32.34, 32.34, 32.34, 32.54, 32.54, 32.54, 32.54, 32.54, 33.41, 33.41, 33.41, 33.41, 33.41, 34.1, 34.1, 34.1, 34.1, 34.1, 34.42, 34.42, 34.42, 34.42, 34.42, 34.15, 34.15, 34.15, 34.15, 34.15, 34.28, 34.28, 34.28, 34.28, 34.28, 34.2, 34.2, 34.2, 34.2, 34.2, 33.59, 33.59, 33.59, 33.59, 33.59, 32.94, 32.94, 32.94, 32.94, 32.94, 32.37, 32.37, 32.37, 32.37, 32.37, 32.17, 32.17, 32.17, 32.17, 32.17, 32.31, 32.31, 32.31, 32.31, 32.31, 32.51, 32.51, 32.51, 32.51, 32.51, 31.67, 31.67, 31.67, 31.67, 31.67, 31.37, 31.37, 31.37, 31.37, 31.37, 31.24, 31.24, 31.24, 31.24, 31.24, 31.2, 31.2, 31.2, 31.2, 31.2, 31.37, 31.37, 31.37, 31.37, 31.37, 31.26, 31.26, 31.26, 31.26, 31.26, 31.5, 31.5, 31.5, 31.5, 31.5, 31.79, 31.79, 31.79, 31.79, 31.79, 31.8, 31.8, 31.8, 31.8, 31.8, 31.37, 31.37, 31.37, 31.37, 31.37, 31.25, 31.25, 31.25, 31.25, 31.25, 31.51, 31.51, 31.51, 31.51, 31.51, 31.69, 31.69, 31.69, 31.69, 31.69, 31.88, 31.88, 31.88, 31.88, 31.88, 31.98, 31.98, 31.98, 31.98, 31.98, 31.68, 31.68, 31.68, 31.68, 31.68, 31.58, 31.58, 31.58, 31.58, 31.58, 31.58, 31.58, 31.58, 31.58, 31.58, 31.37, 31.37, 31.37, 31.37, 31.37, 31.31, 31.31, 31.31, 31.31, 31.31, 31.44, 31.44, 31.44, 31.44, 31.44, 31.53, 31.53, 31.53, 31.53, 31.53, 31.65, 31.65, 31.65, 31.65, 31.65, 31.78, 31.78, 31.78, 31.78, 31.78, 31.54, 31.54, 31.54, 31.54, 31.54, 31.11, 31.11, 31.11, 31.11, 31.11, 31.1, 31.1, 31.1, 31.1, 31.1, 29.97, 29.97, 29.97, 29.97, 29.97, 29.86, 29.86, 29.86, 29.86, 29.86, 29.95, 29.95, 29.95, 29.95, 29.95, 30.1, 30.1, 30.1, 30.1, 30.1, 30.29, 30.29, 30.29, 30.29, 30.29, 30.29, 30.29, 30.29, 30.29, 30.29, 30.28, 30.28, 30.28, 30.28, 30.28, 30.08, 30.08, 30.08, 30.08, 30.08, 30.05, 30.05, 30.05, 30.05, 30.05, 30.15, 30.15, 30.15, 30.15, 30.15, 30.33, 30.33, 30.33, 30.33, 30.33, 30.41, 30.41, 30.41, 30.41, 30.41, 30.56, 30.56, 30.56, 30.56, 30.56, 30.57, 30.57, 30.57, 30.57, 30.57, 30.61, 30.61, 30.61, 30.61, 30.61, 30.63, 30.63]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 557 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715474936 --> 1715475558
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.13, 0.13, 0.13, 0.13, 0.13, 0.4, 0.4, 0.4, 0.4, 0.4, 0.16, 0.16, 0.16, 0.16, 0.16, 0.14, 0.14, 0.14, 0.14, 0.14, 0.25, 0.25, 0.25, 0.25, 0.25, 0.17, 0.17, 0.17, 0.17, 0.17, 0.08, 0.08, 0.08, 0.08, 0.08, 0.16, 0.16, 0.16, 0.16, 0.16, 0.14, 0.14, 0.14, 0.14, 0.14, 0.22, 0.22, 0.22, 0.22, 0.22, 0.23, 0.23, 0.23, 0.23, 0.23, 0.2, 0.2, 0.2, 0.2, 0.2, 0.21, 0.21, 0.21, 0.21, 0.21, 0.31, 0.31, 0.31, 0.31, 0.31, 0.23, 0.23, 0.23, 0.23, 0.23, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.25, 0.25, 0.25, 0.25, 0.25, 0.26, 0.26, 0.26, 0.26, 0.26, 0.32, 0.32, 0.32, 0.32, 0.32, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.09, 0.09, 0.09, 0.09, 0.09, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.32, 0.32, 0.32, 0.32, 0.32, 0.23, 0.23, 0.23, 0.23, 0.23, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.23, 0.23, 0.23, 0.23, 0.23, 0.17, 0.17, 0.17, 0.17, 0.17, 0.31, 0.31, 0.31, 0.31, 0.31, 0.16, 0.16, 0.16, 0.16, 0.16, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.09, 0.09, 0.09, 0.09, 0.09, 0.12, 0.12, 0.12, 0.12, 0.12, 0.35, 0.35, 0.35, 0.35, 0.35, 0.5, 0.5, 0.5, 0.5, 0.5, 0.58, 0.58, 0.58, 0.58, 0.58, 0.54, 0.54, 0.54, 0.54, 0.54, 0.1, 0.1, 0.1, 0.1, 0.1, 0.18, 0.18, 0.18, 0.18, 0.18, 0.11, 0.11, 0.11, 0.11, 0.11, 0.19, 0.19, 0.19, 0.19, 0.19, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17, 0.17, 0.17, 0.17, 0.25, 0.25, 0.25, 0.25, 0.25, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.14, 0.14, 0.14, 0.14, 0.14, 0.08, 0.08, 0.08, 0.08, 0.08, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.22, 0.22, 0.22, 0.22, 0.22, 0.14, 0.14, 0.14, 0.14, 0.14, 0.17, 0.17]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 557 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715474936 --> 1715475558
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0]
                    
Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants