Update TensorRT-LLM (#1530)

NVIDIA · Apr 30, 2024 · 06c0e9b · 06c0e9b
1 parent 66ef1df
commit 06c0e9b
Show file tree

Hide file tree

Showing 200 changed files with 8,814 additions and 2,695 deletions.
diff --git a/.github/workflows/auto_close_inactive_issues.yml b/.github/workflows/auto_close_inactive_issues.yml
@@ -0,0 +1,25 @@
+# Ref: https://docs.github.com/en/actions/managing-issues-and-pull-requests/closing-inactive-issues
+name: Close inactive issues
+on:
+  schedule:
+    - cron: "30 1 * * *"
+
+jobs:
+  stale:
+    runs-on: ubuntu-latest
+    permissions:
+      issues: write
+      pull-requests: write
+    steps:
+      - uses: actions/stale@v9
+        with:
+          days-before-issue-stale: 30
+          days-before-issue-close: 15
+          stale-issue-label: "stale"
+          exempt-issue-labels: ""
+          stale-issue-message: This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
+          close-issue-message: "This issue was closed because it has been stalled for 15 days with no activity."
+          days-before-pr-stale: -1
+          days-before-pr-close: -1
+          repo-token: ${{ secrets.GITHUB_TOKEN }}
+          debug-only: true
diff --git a/benchmarks/cpp/README.md b/benchmarks/cpp/README.md
@@ -67,7 +67,7 @@ If you want to get the logits, you could run gptSessionBenchmark with `--print_a
 
 #### Prepare dataset
 
-Run a preprocessing script to prepare/generate dataset into a json that gptManagerBenchmark can consume later. The processed output json has *input token ids, output tokens length and time delays* to control request rate by gptManagerBenchmark.
+Run a preprocessing script to prepare/generate dataset into a json that gptManagerBenchmark can consume later. The processed output json has *input tokens length, input token ids and output tokens length*
 
 This tool can be used in 2 different modes of traffic generation.
 
@@ -79,8 +79,6 @@ The tool will tokenize the words and instruct the model to generate a specified
 python3 prepare_dataset.py \
     --tokenizer <path/to/tokenizer> \
     --output preprocessed_dataset.json
-    [--request-rate 10] \
-    [--time-delay-dist exponential_dist] \
     dataset
     --dataset-name <name of the dataset> \
     --dataset-split <split of the dataset to use> \
@@ -118,8 +116,6 @@ For example, setting mean=100 and std dev=10 would generate requests where 95.4%
 ```
 python prepare_dataset.py \
   --output token-norm-dist.json \
-  --request-rate 10 \
-  --time-delay-dist constant \
   --tokenizer <path/to/tokenizer> \
    token-norm-dist \
    --num-requests 100 \
@@ -148,6 +144,7 @@ Take GPT-350M as an example for single GPU V1 batching
 ./benchmarks/gptManagerBenchmark \
     --engine_dir ../../examples/gpt/trt_engine/gpt2/fp16/1-gpu/ \
     --type V1 \
+    --request_rate 10 \
     --dataset ../../benchmarks/cpp/preprocessed_dataset.json
     --max_num_samples 500
 ```
@@ -157,6 +154,7 @@ Take GPT-350M as an example for 2-GPU inflight batching
 mpirun -n 2 ./benchmarks/gptManagerBenchmark \
     --engine_dir ../../examples/gpt/trt_engine/gpt2-ib/fp16/2-gpu/ \
     --type IFB \
+    --request_rate 10 \
     --dataset ../../benchmarks/cpp/preprocessed_dataset.json
     --max_num_samples 500
 ```