Skip to content

Conversation

@ggerganov
Copy link
Member

The Persimmon arch does not seem to work correctly and is implemented in a convoluted way that does not fit the existing patterns. It's better to reimplement this from scratch

@github-actions github-actions bot added the python python script changes label May 20, 2024
@mofosyne mofosyne added the Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix label May 20, 2024
@github-actions
Copy link
Contributor

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 536 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8703.72ms p(95)=21184.22ms fails=, finish reason: stop=480 truncated=56
  • Prompt processing (pp): avg=101.29tk/s p(95)=463.69tk/s
  • Token generation (tg): avg=47.82tk/s p(95)=46.93tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=gg/remove-persimmon commit=5d777e9c22d370bd5944c9002771b2f52da18637

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 536 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1716200155 --> 1716200777
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 441.7, 441.7, 441.7, 441.7, 441.7, 525.84, 525.84, 525.84, 525.84, 525.84, 523.8, 523.8, 523.8, 523.8, 523.8, 565.45, 565.45, 565.45, 565.45, 565.45, 632.54, 632.54, 632.54, 632.54, 632.54, 641.88, 641.88, 641.88, 641.88, 641.88, 663.48, 663.48, 663.48, 663.48, 663.48, 681.42, 681.42, 681.42, 681.42, 681.42, 704.72, 704.72, 704.72, 704.72, 704.72, 704.11, 704.11, 704.11, 704.11, 704.11, 707.11, 707.11, 707.11, 707.11, 707.11, 714.01, 714.01, 714.01, 714.01, 714.01, 734.17, 734.17, 734.17, 734.17, 734.17, 753.64, 753.64, 753.64, 753.64, 753.64, 757.12, 757.12, 757.12, 757.12, 757.12, 763.48, 763.48, 763.48, 763.48, 763.48, 780.42, 780.42, 780.42, 780.42, 780.42, 790.56, 790.56, 790.56, 790.56, 790.56, 789.45, 789.45, 789.45, 789.45, 789.45, 797.51, 797.51, 797.51, 797.51, 797.51, 801.34, 801.34, 801.34, 801.34, 801.34, 822.45, 822.45, 822.45, 822.45, 822.45, 820.41, 820.41, 820.41, 820.41, 820.41, 822.61, 822.61, 822.61, 822.61, 822.61, 839.01, 839.01, 839.01, 839.01, 839.01, 837.45, 837.45, 837.45, 837.45, 837.45, 835.92, 835.92, 835.92, 835.92, 835.92, 834.9, 834.9, 834.9, 834.9, 834.9, 840.52, 840.52, 840.52, 840.52, 840.52, 840.61, 840.61, 840.61, 840.61, 840.61, 838.12, 838.12, 838.12, 838.12, 838.12, 842.93, 842.93, 842.93, 842.93, 842.93, 855.86, 855.86, 855.86, 855.86, 855.86, 848.63, 848.63, 848.63, 848.63, 848.63, 848.51, 848.51, 848.51, 848.51, 848.51, 851.47, 851.47, 851.47, 851.47, 851.47, 849.66, 849.66, 849.66, 849.66, 849.66, 848.51, 848.51, 848.51, 848.51, 848.51, 852.06, 852.06, 852.06, 852.06, 852.06, 854.41, 854.41, 854.41, 854.41, 854.41, 854.22, 854.22, 854.22, 854.22, 854.22, 858.92, 858.92, 858.92, 858.92, 858.92, 859.08, 859.08, 859.08, 859.08, 859.08, 858.38, 858.38, 858.38, 858.38, 858.38, 856.42, 856.42, 856.42, 856.42, 856.42, 854.46, 854.46, 854.46, 854.46, 854.46, 847.66, 847.66, 847.66, 847.66, 847.66, 849.01, 849.01, 849.01, 849.01, 849.01, 847.91, 847.91, 847.91, 847.91, 847.91, 852.32, 852.32, 852.32, 852.32, 852.32, 853.23, 853.23, 853.23, 853.23, 853.23, 859.22, 859.22, 859.22, 859.22, 859.22, 858.37, 858.37, 858.37, 858.37, 858.37, 861.86, 861.86, 861.86, 861.86, 861.86, 860.64, 860.64, 860.64, 860.64, 860.64, 861.59, 861.59, 861.59, 861.59, 861.59, 861.3, 861.3, 861.3, 861.3, 861.3, 861.36, 861.36, 861.36, 861.36, 861.36, 860.67, 860.67, 860.67, 860.67, 860.67, 863.08, 863.08, 863.08, 863.08, 863.08, 862.54, 862.54]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 536 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1716200155 --> 1716200777
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 38.79, 38.79, 38.79, 38.79, 38.79, 41.27, 41.27, 41.27, 41.27, 41.27, 33.8, 33.8, 33.8, 33.8, 33.8, 35.73, 35.73, 35.73, 35.73, 35.73, 35.33, 35.33, 35.33, 35.33, 35.33, 36.52, 36.52, 36.52, 36.52, 36.52, 37.33, 37.33, 37.33, 37.33, 37.33, 37.73, 37.73, 37.73, 37.73, 37.73, 37.27, 37.27, 37.27, 37.27, 37.27, 37.09, 37.09, 37.09, 37.09, 37.09, 36.08, 36.08, 36.08, 36.08, 36.08, 34.75, 34.75, 34.75, 34.75, 34.75, 34.76, 34.76, 34.76, 34.76, 34.76, 33.52, 33.52, 33.52, 33.52, 33.52, 32.83, 32.83, 32.83, 32.83, 32.83, 31.98, 31.98, 31.98, 31.98, 31.98, 32.13, 32.13, 32.13, 32.13, 32.13, 31.85, 31.85, 31.85, 31.85, 31.85, 31.75, 31.75, 31.75, 31.75, 31.75, 31.73, 31.73, 31.73, 31.73, 31.73, 31.91, 31.91, 31.91, 31.91, 31.91, 31.91, 31.91, 31.91, 31.91, 31.91, 31.6, 31.6, 31.6, 31.6, 31.6, 31.66, 31.66, 31.66, 31.66, 31.66, 31.88, 31.88, 31.88, 31.88, 31.88, 31.66, 31.66, 31.66, 31.66, 31.66, 31.52, 31.52, 31.52, 31.52, 31.52, 31.6, 31.6, 31.6, 31.6, 31.6, 31.78, 31.78, 31.78, 31.78, 31.78, 31.87, 31.87, 31.87, 31.87, 31.87, 31.91, 31.91, 31.91, 31.91, 31.91, 31.94, 31.94, 31.94, 31.94, 31.94, 31.96, 31.96, 31.96, 31.96, 31.96, 31.79, 31.79, 31.79, 31.79, 31.79, 31.61, 31.61, 31.61, 31.61, 31.61, 31.1, 31.1, 31.1, 31.1, 31.1, 30.78, 30.78, 30.78, 30.78, 30.78, 30.86, 30.86, 30.86, 30.86, 30.86, 30.94, 30.94, 30.94, 30.94, 30.94, 31.14, 31.14, 31.14, 31.14, 31.14, 31.25, 31.25, 31.25, 31.25, 31.25, 31.02, 31.02, 31.02, 31.02, 31.02, 30.82, 30.82, 30.82, 30.82, 30.82, 30.56, 30.56, 30.56, 30.56, 30.56, 29.68, 29.68, 29.68, 29.68, 29.68, 28.73, 28.73, 28.73, 28.73, 28.73, 28.79, 28.79, 28.79, 28.79, 28.79, 28.79, 28.79, 28.79, 28.79, 28.79, 28.78, 28.78, 28.78, 28.78, 28.78, 28.83, 28.83, 28.83, 28.83, 28.83, 28.9, 28.9, 28.9, 28.9, 28.9, 28.97, 28.97, 28.97, 28.97, 28.97, 28.89, 28.89, 28.89, 28.89, 28.89, 28.94, 28.94, 28.94, 28.94, 28.94, 28.84, 28.84, 28.84, 28.84, 28.84, 28.85, 28.85, 28.85, 28.85, 28.85, 28.97, 28.97, 28.97, 28.97, 28.97, 29.1, 29.1, 29.1, 29.1, 29.1, 29.2, 29.2, 29.2, 29.2, 29.2, 29.22, 29.22, 29.22, 29.22, 29.22, 29.29, 29.29]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 536 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1716200155 --> 1716200777
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.11, 0.11, 0.11, 0.11, 0.11, 0.37, 0.37, 0.37, 0.37, 0.37, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.13, 0.13, 0.13, 0.13, 0.13, 0.2, 0.2, 0.2, 0.2, 0.2, 0.26, 0.26, 0.26, 0.26, 0.26, 0.08, 0.08, 0.08, 0.08, 0.08, 0.19, 0.19, 0.19, 0.19, 0.19, 0.35, 0.35, 0.35, 0.35, 0.35, 0.3, 0.3, 0.3, 0.3, 0.3, 0.2, 0.2, 0.2, 0.2, 0.2, 0.13, 0.13, 0.13, 0.13, 0.13, 0.24, 0.24, 0.24, 0.24, 0.24, 0.21, 0.21, 0.21, 0.21, 0.21, 0.28, 0.28, 0.28, 0.28, 0.28, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.36, 0.36, 0.36, 0.36, 0.36, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.34, 0.34, 0.34, 0.34, 0.34, 0.26, 0.26, 0.26, 0.26, 0.26, 0.19, 0.19, 0.19, 0.19, 0.19, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.21, 0.21, 0.21, 0.21, 0.21, 0.14, 0.14, 0.14, 0.14, 0.14, 0.23, 0.23, 0.23, 0.23, 0.23, 0.27, 0.27, 0.27, 0.27, 0.27, 0.26, 0.26, 0.26, 0.26, 0.26, 0.34, 0.34, 0.34, 0.34, 0.34, 0.32, 0.32, 0.32, 0.32, 0.32, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.1, 0.1, 0.1, 0.1, 0.1, 0.33, 0.33, 0.33, 0.33, 0.33, 0.58, 0.58, 0.58, 0.58, 0.58, 0.64, 0.64, 0.64, 0.64, 0.64, 0.7, 0.7, 0.7, 0.7, 0.7, 0.52, 0.52, 0.52, 0.52, 0.52, 0.08, 0.08, 0.08, 0.08, 0.08, 0.23, 0.23, 0.23, 0.23, 0.23, 0.31, 0.31, 0.31, 0.31, 0.31, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.17, 0.17, 0.17, 0.17, 0.17, 0.33, 0.33, 0.33, 0.33, 0.33, 0.25, 0.25, 0.25, 0.25, 0.25, 0.31, 0.31, 0.31, 0.31, 0.31, 0.15, 0.15, 0.15, 0.15, 0.15, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.08, 0.08, 0.08, 0.08, 0.08, 0.21, 0.21, 0.21, 0.21, 0.21, 0.26, 0.26]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 536 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1716200155 --> 1716200777
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 1.0, 1.0]
                    
Loading

@mofosyne mofosyne merged commit fabf30b into master May 20, 2024
@mofosyne mofosyne deleted the gg/remove-persimmon branch May 20, 2024 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python python script changes Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants