Code Llama 70b triton crashes with XQA #1256

phind-justin · 2024-03-08T05:39:09Z

System Info

x86
8x h100 80g
v0.8.0

Who can help?

@Tracin

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

python convert_checkpoint.py --model_dir /mnt/nfsfast/codellama-70b\
                            --output_dir /mnt/nfsfast/models-0.8.0/codellama-70b-base \
                            --dtype bfloat16 \
                            --rotary_base 1000000 \
                            --vocab_size 32016 \
                            --tp_size 8 \
                            --use_parallel_embedding


trtllm-build --checkpoint_dir /mnt/nfsfast/models-0.8.0/codellama-70b
   --output_dir /mnt/nfsfast/models-0.8.0/trt_engines/codellama-70b/8-gpu/
   --gemm_plugin bfloat16
   --max_input_len 13848
   --max_output_len 2536
   --max_batch_size 32
   --workers 8
   --remove_input_padding enable
   --gpt_attention_plugin bfloat16
   --context_fmha enable
   --paged_kv_cache enable
   --multi_block_mode enable
   --use_custom_all_reduce enable
   --bert_attention_plugin disable
   --enable_xqa enable

I then use triton to serve the model and send a few inferences

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 mpirun -n 8 --allow-run-as-root /opt/tritonserver/bin/tritonserver  --model-repository=...

it seems to work for a few inferences, especially when there's only 1-2 at a time. But after a few inferences, or if we send 4 at a time, it doesn't work.

Expected behavior

it runs successfully and does not crash for batch sizes up to 32!

actual behavior

It crashes with an XQA error after a few inferences

codellama_70b_xqa.log

additional notes

.

The text was updated successfully, but these errors were encountered:

phind-justin · 2024-03-08T22:11:23Z

I was looking into the XQA prepare and dispatch functions here https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQARunner.h#L185
it seems like the mPrepareCalled could be subject to a race condition, however I'm sure you have more context on the synchonization conditions than me. I can't tell what other mechanisms are in place to make mPrepareCalled thread-safe

    void prepare(const XQAParams& xqa_params)
    {
        if (!mPrepareCalled)
        {
            this->prepareForRun(xqa_params);
            mPrepareCalled = true;
        }
    }

    template <typename KVCacheBuffer>
    void dispatch(const XQAParams& xqa_params, KVCacheBuffer& kv_cache_buffer, const cudaStream_t& stream)
    {
        if (!mPrepareCalled)
        {
            TLLM_THROW("DecoderXQARunner::prepare() hasn't been called before DecoderXQARunner::dispatch().");
        }
        sync_check_cuda_error();
        this->run(xqa_params, kv_cache_buffer, stream);
    }

private:
    bool shouldUseImpl(const XQAParams& xqa_params);
    void prepareForRun(const XQAParams& xqa_params);

    template <typename KVCacheBuffer>
    void run(const XQAParams& xqa_params, KVCacheBuffer& kv_cache_buffer, const cudaStream_t& stream);

    static constexpr int kMaxBeamWidth = 4;

    // Cache the grid_size and block_size that gives the highest occupancy for
    //  invokeApplyBiasRopeUpdateKVCache.
    int2 mLaunchGridBlockCache = make_int2(0, 0);

    bool mPrepareCalled;

michaelroyzen · 2024-03-11T17:49:35Z

@jdemouth-nvidia @byshiue @ncomly-nvidia @nekorobov This is a major issue impeding out ability to make progress on serving our models more efficiently. Do you have any updates on when XQA will actually be working? Thanks!

ming-wei · 2024-03-12T02:02:26Z

DecoderXQRRunner::prepare() is effectively a no-op, because DecoderXQARunner::prepareForRun() calls into DecoderXQAImplPrecompiled::prepare(), which has an empty function body.

I will create a MR to remove the mPreparedCalled check for now.

juney-nvidia · 2024-03-12T04:28:36Z

Thanks @ming-wei.
To clarify, the fix from Ming is being reviewed in our internal repo, and hopefully it will be part of the next Tuesday's github main branch update.

June

phind-justin · 2024-03-12T20:40:50Z

Thanks @ming-wei and @juney-nvidia

Since it is a no-op, I tried uncommenting the lines below and rebuilt 0.8.0 from source
in tensorrt_llm/cpp/tensorrt_llm/kernels/decoderMaskedMultiheadAttention/decoderXQARunner.h

if (!mPrepareCalled)
        {
            TLLM_THROW("DecoderXQARunner::prepare() hasn't been called before DecoderXQARunner::dispatch().");
        }

The model (codellama70b xqa as compiled above ^) seems to run and serve requests! However now there are repeats in the output.

Here are samples of outputs given

    "86": "Para usar la clase DocumentoFoliacionDto en tu servicio necesitas crear un mtodo que reciba un DocumentoFoliacionDto y lo guarde en la base de datos Aqu te dejo un ejemplo de cmo podras hacerlojavapublic DocumentoFoliacionEntity setPaginaConVueltaDocumentoFoliacionDto documentoFoliacionDto  DocumentoFoliacionEntity documentoFoliacionEntity  mapDtoToEntitydocumentoFoliacionDto return documentoFoliacionDAOsavedocumentoFoliacionEntityEn este cdigo el mtodo setPaginaConVuelta recibe un DocumentoFoliacionDto lo mapea a una entidad DocumentoFoliacionEntity usando el mtodo mapDtoToEntity y luego guarda la entidad en la base de datosAdems puedes agregar un nuevo endpoint en tu controlador DocumentoFoliacionController que permita a los clientes llamar a este mtodojavaPostMappingpaginaconvueltapublic ResponseEntityDocumento setPagina setPagina setPag setPagina setPagina setPagina setPagDocumento setPag Documento setPag                setPueditarooFoljava oFolooooojavajavajavajavajava setPued to dooooo       java",
    "88": "Yes you can have classcompletion in HTML using akin Neovimplementedits in Neovim using akin HTML using akin HTML using akin Angular using akin Angular using akin HTML using akin Neovimplementing akin Neovimplementing with the same as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as well as a Language Servercompletion is Angular provides aka",
    "90": "To decode the API response and use it in your code you can follow these steps1 Fetch the API response You can use the http package in Flutter to fetch the API responses from the API responses from the API data from the API responses from the API responses from the API responses from the API responses from the API responses from the API responses from the API responses from the API data from the API responses from the API data from the API data from the API data from the API data from the API data from the API data from the API data from the API data from the API data from the API data from the API response Here is an API response from the API response Here is a response The http Make sure asynchronously Make sure asynchronously fetch the API Example",
    "92": "El ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo contable ciclo ciclo contable contable contable contable contable ciclo contable contable contable ciclo contable contable ciclo contable contable contable contable ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo contable contable ciclo contable ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo contable contable contable ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo contable contable contable contable contable contable contable contable contable contable contable contable contable contable ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo contable contable contable contable contable contable contable contable contable contable contable contable contable contable contable contable contable contable contable contable contable contable contable contable contable contable contable contable contable contable contable contable Source Source El ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo contable ciclo ciclo ciclo contable contable ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo ciclo s unificinforma",
    "94": "Para mantener la consistencia en tu cdigo puedes seguir el mismo patrn que usaste en el mtodo handleOrganizationDropdown Aqu te muestro cmo podras hacerlojavascripthandleAddonChange  event ReactChangeEventHTMLInputElement   const  checked value   eventtarget thissetStateprevState      let selectedAddons  prevStateselectedAddons    if checked       selectedAddonspushvalue     else       selectedAddons  selectedAddonsfilteraddon  addon  value        return  selectedAddons  En este cdigo primero obtenemos el valor y si est marcado del evento Luego actualizamos el estado selectedAddons basndonos en el estado anterior Si el checkbox est marcado agregamos el valor al array selectedAddons Si no est marcado filtramos el valor del array selectedAddonsEste cdigo asume que el valor de cada checkbox es nico Si hay posibilidad de que haya valores duplicados deberas ajustar el cdigo para manejar esojarlosar esojar esojar esojar esojar esojar esojar esojar esojar esojar esoble esto",
    "95": "To read JSON files in NetBeing in NetBeing in NetBeing in NetBeing NetBeing in NetBeing NetBeing NetBeing in NetBeing in NetBeing in NetBe sure in NetBeing in NetBeing in NetBeing in NetBeing in NetBeing in NetBe able to use the in NetBe able to work with NetBeing NetBeing NetBeing NetBeing NetBe sure you can be able to use the NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBe sure you need to work with NetBeing NetBe sure you need to work with NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBeing NetBe sure you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you need to read JSON filescomTo read JSON files in NetBeing NetBeing JSON files in NetBeing NetBeing NetBeing NetBeing NetBeing NetBe sure you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can you can yous in NetBe sure you can you can you can you can yous1comjsonObjectcomjavacomjsonObjectcomcomjsonObjectcom",
    "96": "Mocking Firebase Realtime Realtime Realtime Realtime Realtime Realtime Realtime Realtime Realtime Realtime Realtime Realtime Realtime Realtime Realtime Realtime Realtime Realtime Realtime Realtime Realtime Realtime Realtimeans Realtime Realtime Realtogether with Vitest Realtogether with Vitest database with Vitest database in a Realtogether to test a Realtogether with Vitestimate realtime to test a Realtogether with Vitestabsolved to test environment in a Realtogether with Vitestabsolve Realtogether with VitestabsolveRealtogether with VitestabsolveRealtogether with Vitestabrealtogether with Vitestabsolverealtogether with Vitestabsolverealtogether with Vitestimate realtime to mock firebaserealtogether with Vitestimatex",
    "97": "According to the Department of Transportation DOT US airline DOT US airlines DOT US regulations DOT US airlines DOT US airlines DOT US airlines DOT US airlines DOT US airlines DOT US airlines DOThttpswww DOThttpswww In the United States US airlines in the United States US airlines in the US airlines in the US airlines are not only US airlines in the rules set forthese rules set forthese are required by the rules designed to the rules designed to keep passengers are not only US airlines are not only US airlines in the US airlines are not only in the rules set by Source Source 1 If a If the same as pertained by the future the same as pertain the same as pertaining to ensure that would be of transportationally lines in the same as mentioned in the same as of the same as pertain in the airlines are allowed to ensure that is more than Source 4httpswww The Department of transportationally the airlines of transportationally the same class and the same classhttpswww Source ",
    "99": "After completing a course of PostExposure Prophylaxis PEP it is important to get tested for HIV This is because PEP is not allEPP is not all thePAP is not only reduces the PEPP does not only reduces the PEPP while significantly reduces the PEPP while significantly reduces the virus testing can significantly reduces the PEPP does not only reduces the PEPP does not only reduces the PEPPep is not only reduces the PEPPep testing can significantly reduces the treatment does not only reduces the virus is not only reduces the virus Although PEPPep PEPPep PEPPepPepPepPep PEPPepPep PEPPepPepPepPepPepPep PEPPepPep PEPP EPPep PEPPep PEPPep PEPPep PEPPep PEPPep PEPPep PEPPepPep PEPPep PEPPepPep PEPPep PEPP EPPep PEPPep PEPP Source 2httpswww 2httpswww Its its its it is not less than 24httpswww The PEPP The PEPP The exact 24httpswww The PEPP The PEPP The PEPP 24 The PEPPep The PEPP 24httpswww PEPP 2httpswww PEPPep PEP The PEPPep withinfection PEPep PEPP The PEPPep PEP The PEPP The PEPPep PEPPep PEPP If yous aids PEPP The PEPP However but it is detected in the first The medications the course of the course of the course of the course of the course of the course of the course of the course of exposure but before 2 2 weeks after completes the course of the course 2 2 2 2 The PEPP 2 2 2 The PEPP 2 ",
    "98": "To make a buy order for a stock using the Interactive Brokers IB Python Python Python Python Python Python Python Python Python Python Python Python Python Python Python Python API in Python Python Python API in Python API you can be Python Python API you can be Python API Inc Python API Inccom Python API Inccom Python API Inc Python API you can be Python API Python API you can be Python API you can be sure you can be sure you can be sure you can be Python API you can be sure you can be Python API you can be sure as described in Python API you can be sure you can be sure you can be sure you can be using Python API you can be sure you can be sure you can be Python API you can be Python API you can be Python API you can be Python you can be Python you can be sure its Python Python you can be sure such as described in Python "

phind-justin · 2024-03-12T20:45:37Z

To clarify I used the instructions here: https://github.com/triton-inference-server/tensorrtllm_backend

cd tensorrtllm_backend
git lfs install
git submodule update --init --recursive

# here i removed the 3 lines related to ASSERT XQA

# Use the Dockerfile to build the backend in a container
# For x86_64
DOCKER_BUILDKIT=1 docker build -t triton_trt_llm -f dockerfile/Dockerfile.trt_llm_backend .

Even with only 4 concurrent requests and repetition_penalty=1.1 here are the outputs.

    "0": "Excommunication is a religious practice used to deprive suspend or limit membership in a religious community or to restrict certain rights within it particularly those of being in communion with other members of the congregation and of receiving the sacraments Source Source Source Source Source Source Source Source Source Source ",
    "2": " 1          1 11lengthy   ki  sample1n 1 prob  L       Generate mu   r  5  sumcoal1kt   mui  rgamma1 shape  r rate  kt  b1       Generate lambda   if kt  1  n r  5  sumcoal else     r  5  sumcoalkt1n   lambdai  rgamma1 shape  r rate  n  kt  b2       Generate b1 and b2   b1  rgamma1 shape  5 rate  mui1   b2  rgamma1 shape  5 rate  lambdai1      for j in 1n      Lj  explambdai  mui  j        1  mui  explambdaij          Normalize L   L  L  sumL   returnlistk  k mu  mu lambda  lambda Call the functionresult  gibbssamplingn mprintresultGibbsmb1b2",
    "1": "In iOS properties are defined using the property directive They represent an attribute of an object and can be accessed through getter and setter methods The property    properties properties properties properties are access other properties and ensure the other properties and ensure other properties and ensure to read about the propertystringpropertyand setter methods ensure that the property is correctlynew instances of the class are correctly read to read about the new instances of the class to read about the new instances of the class are correctly read totoostringtoostringtoostringtoostringtoostringtoostringtoostringtoostringtoostringtoostringtoostringtoostringtoostringtoostring and setter methods to ensure that the property is correctly read to ensure that the property is correctly read to ensure that the property is simply read to ensure that the property is simply read to ensure that the property is simple read to ensure that the property is simple read to ensure that the property is simple read to ensure that the property is simple read to ensure that the property is simple read to ensure that the property is simple read to ensure that the property is simple read to ensure that the property is simple read to ensure that the property is simple read to ensure that the property is simple read to ensure that the property is simple read to ensure that the property is simple read to ensure that the property is simple read to ensure that the property is simple read to ensure that the property is simple read to ensure that the property is simple read to ensure that the property is simple read to ensure that the property is simple read to ensure that the property is simple read to ensure that the property is simple read to ensure that the property is simple read to ensure that the property is simple read to ensure that the property is simple read to ensure that the property is simple read to ensure that the property is simple read to ensure that the property is simple read to ensure that the property is simple read to ensure",
    "3": "4     The purpose of the purpose of the purpose of the purpose of the purpose of the purpose of the purpose of the purpose of the purpose of the purpose of the purpose of the purpose of the purpose of the purpose of the purpose of the purpose of the purpose of the purpose of the purpose of the purpose of the purpose of the purpose of the purpose of the purpose of the purpose of the purpose of the purpose of the purpose of the mixed The purpose of the mixed The purpose of the mixed The Leadvanced the the the as the purpose of the as the purpose of this research as the purpose of this research as the purpose of this research as the purpose of this research as the purpose of this research as the purpose of this research as the purpose of this research as the purpose of this research as the purpose of this research0s0 as the purpose of this research0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s0s"

phind-justin · 2024-03-19T17:32:53Z

This is resolved - it looks like adding strongly typed and removing multiblock mode worked.

phind-justin added the bug Something isn't working label Mar 8, 2024

byshiue assigned ming-wei Mar 12, 2024

byshiue added the triaged Issue has been triaged by maintainers label Mar 12, 2024

phind-justin closed this as completed Mar 19, 2024

kaiyux mentioned this issue Apr 16, 2024

Update TensorRT-LLM #1455

Merged

kaiyux mentioned this issue Jun 5, 2024

TensorRT-LLM v0.10 update #1734

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Code Llama 70b triton crashes with XQA #1256

Code Llama 70b triton crashes with XQA #1256

phind-justin commented Mar 8, 2024 •

edited

Loading

phind-justin commented Mar 8, 2024

michaelroyzen commented Mar 11, 2024

ming-wei commented Mar 12, 2024

juney-nvidia commented Mar 12, 2024

phind-justin commented Mar 12, 2024

phind-justin commented Mar 12, 2024 •

edited

Loading

phind-justin commented Mar 19, 2024

Code Llama 70b triton crashes with XQA #1256

Code Llama 70b triton crashes with XQA #1256

Comments

phind-justin commented Mar 8, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

phind-justin commented Mar 8, 2024

michaelroyzen commented Mar 11, 2024

ming-wei commented Mar 12, 2024

juney-nvidia commented Mar 12, 2024

phind-justin commented Mar 12, 2024

phind-justin commented Mar 12, 2024 • edited Loading

phind-justin commented Mar 19, 2024

phind-justin commented Mar 8, 2024 •

edited

Loading

phind-justin commented Mar 12, 2024 •

edited

Loading