Use slow tokenizer for LLaMA by WoosukKwon · Pull Request #84 · vllm-project/vllm

WoosukKwon · 2023-05-08T01:08:40Z

Fixes #80
Should be merged after #82

This PR fixes the frontends to not use LLaMA fast tokenizer, which causes a protobuf bug. We should use the normal tokenizer instead.

Co-authored-by: Krzysztof Laskowski <klaskowski@habana.ai>

### What this PR does / why we need it? Changed default block_size in platform.py from 16 to 128, as Ascend Devices have a better affinity for block size 128. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Signed-off-by: hzji210@gmail.com <hzji210@gmail.com>

rename files

…_devices (vllm-project#84) Signed-off-by: Salar <skhorasgani@tenstorrent.com> (cherry picked from commit 5999673)

) Signed-off-by: syedmba <syedmba7@connect.hku.hk>

vllm-project#84) * refactor(Sentinel): Abstract and refactor class to standardize fault tolerance logic Signed-off-by: w00689259 <wangzhuo66@huawei.com> * refactor(Sentinel): Abstract and refactor class to standardize fault tolerance logic Signed-off-by: w00689259 <wangzhuo66@huawei.com> * refactor(Sentinel): Abstract and refactor class to standardize fault tolerance logic Signed-off-by: w00689259 <wangzhuo66@huawei.com> * refactor(Sentinel): Abstract and refactor class to standardize fault tolerance logic Signed-off-by: w00689259 <wangzhuo66@huawei.com> * refactor(Sentinel): Abstract and refactor class to standardize fault tolerance logic Signed-off-by: w00689259 <wangzhuo66@huawei.com> * refactor(Sentinel): Abstract and refactor class to standardize fault tolerance logic Signed-off-by: w00689259 <wangzhuo66@huawei.com> * refactor(Sentinel): Abstract and refactor class to standardize fault tolerance logic Signed-off-by: w00689259 <wangzhuo66@huawei.com> * refactor(Sentinel): Abstract and refactor class to standardize fault tolerance logic Signed-off-by: w00689259 <wangzhuo66@huawei.com> * refactor(Sentinel): Abstract and refactor class to standardize fault tolerance logic Signed-off-by: w00689259 <wangzhuo66@huawei.com> * refactor(Sentinel): Abstract and refactor class to standardize fault tolerance logic Signed-off-by: w00689259 <wangzhuo66@huawei.com> * refactor(Sentinel): Abstract and refactor class to standardize fault tolerance logic Signed-off-by: w00689259 <wangzhuo66@huawei.com> * refactor(Sentinel): Abstract and refactor class to standardize fault tolerance logic Signed-off-by: w00689259 <wangzhuo66@huawei.com> --------- Signed-off-by: w00689259 <wangzhuo66@huawei.com> Co-authored-by: w00689259 <wangzhuo66@huawei.com> Signed-off-by: zWaNg3 <389750525@qq.com>

* Milestone 1 of Internal Process-level Fault Tolerance * Milestone 1 of Internal Process-level Fault Tolerance (vllm-project#61) * feat(fault-tolerance): add class skeletons for fault tolerance Signed-off-by: fangyuchu <fangyuchu@qq.com> * config: add configuration options for fault tolerance Signed-off-by: fangyuchu <fangyuchu@qq.com> * 增加generate_identity和generate_identitys函数 Generate a unique identity for ZMQ ROUTER node * add service startup configuradtion fault report addr * add init WorkerGuard * add engine_core_cmd_addr、fault_report_addr、client_cmd_addr、engine_core_identitys in EngineZmqAddresses init engine_core_cmd_addr、fault_report_addr、client_cmd_addr in launch_core_engines func add _report_engine_dead func in CoreEngineProcManager * init ClientGuard init EngineZmqAddresses engine_core_identitys * init EngineCoreGuard * change generate_identitys to generate_identity_group * code typesetting is optimized * code typesetting is optimized * changed code format ensure every line < 88 chars * changed code format ensure every line < 88 chars fix error Value of type "dict[Any, Any] | None" is not indexable [index] * fix bug Error: vllm/v1/engine/utils.py:122:89: E501 Line too long (117 > 88) Error: vllm/v1/engine/utils.py:1059:9: F402 Import `uuid` from line 6 shadowed by loop variable * fix Error: vllm/v1/engine/utils.py:1045: error: Need type annotation for "uuids" (hint: "uuids: set[<type>] = ...") [var-annotated] * fix error: Value of type "dict[Any, Any] | None" is not indexable [index] * fix error: Value of type "dict[Any, Any] | None" is not indexable [index] Signed-off-by: a798347923 <2645302020@qq.com> * add _send_msg in EngineCoreGuard Signed-off-by: a798347923 <2645302020@qq.com> * add import torch.cuda * add _recv_cmd function docstring that clearly explains the meaning of the return value. * changed recv_fault_msg to recv_msg add ClientGuard __init__ func parameter types * add engine monitor Signed-off-by: TianZhuo <2770730562@qq.com> * Delete requirements/test.txt~ Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com> * Delete vllm/v1/engine/core_client.py~ Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com> * simply _send_msg and _recv_cmd in EngineCoreGuard * simply recv_msg in ClientGuard * engine: add fault tolerance features for EngineCore. Signed-off-by: fangyuchu <fangyuchu@qq.com> * engine: add timeout mechanism in retry. Signed-off-by: fangyuchu <fangyuchu@qq.com> * add engine monitor * Delete vllm/v1/engine/exceptions.py~ Signed-off-by: 205150940 <112750056+205150940@users.noreply.github.com> * updata actor_index * updata enginedead flag * handle fault and report exception Signed-off-by: w00689259 <wangzhuo66@huawei.com> * fix engine_actor * fix engine_actor fault_info * handle fault and report exception Signed-off-by: w00689259 <wangzhuo66@huawei.com> * delete num_identity * changed try expect * fix debug error * fix one bug. Signed-off-by: fangyuchu <fangyuchu@qq.com> * add fault_report_addr in FaultToleranceConfig * add handle fault&get_fault_info api Signed-off-by: w00689259 <wangzhuo66@huawei.com> * remove fault_report_address in CoreEngineActorManager __init__ Signed-off-by: a798347923 <2645302020@qq.com> * ruff format Signed-off-by: a798347923 <2645302020@qq.com> * add handle fault&get_fault_info api Signed-off-by: w00689259 <wangzhuo66@huawei.com> * fix one bug. Signed-off-by: fangyuchu <fangyuchu@qq.com> * add fault_report_port in FaultToleranceConfig Signed-off-by: a798347923 <2645302020@qq.com> * add zmq_addr concatenate with fault_report_addr and fault_report_port Signed-off-by: a798347923 <2645302020@qq.com> * fault reporter bug fix Signed-off-by: w00689259 <wangzhuo66@huawei.com> * fault reporter bug fix Signed-off-by: w00689259 <wangzhuo66@huawei.com> * fault reporter bug fix Signed-off-by: w00689259 <wangzhuo66@huawei.com> * fault reporter bug fix Signed-off-by: w00689259 <wangzhuo66@huawei.com> * fault reporter bug fix Signed-off-by: w00689259 <wangzhuo66@huawei.com> * fault reporter bug fix Signed-off-by: w00689259 <wangzhuo66@huawei.com> * fix some bug * fault reporter bug fix Signed-off-by: w00689259 <wangzhuo66@huawei.com> * fault reporter bug fix Signed-off-by: w00689259 <wangzhuo66@huawei.com> * remove fault_report_addr in FaultToleranceConfig Signed-off-by: a798347923 <2645302020@qq.com> * refactor: relocate method serialization functions to serial_util.py Signed-off-by: fangyuchu <fangyuchu@qq.com> * fix actor bug * fix actor bug * add engine_core_cmd_addr in FaultToleranceConfig Signed-off-by: a798347923 <2645302020@qq.com> * add and use _stop_worker_execution in EngineCoreGuard Signed-off-by: a798347923 <2645302020@qq.com> * add and use run in WorkerGuard Signed-off-by: a798347923 <2645302020@qq.com> * fix actor bug * fix bug * fix sentinel * fix bug vllm/v1/engine/core.py:847: error: Missing positional argument "tp_size" in call to "EngineCoreGuard" Signed-off-by: a798347923 <2645302020@qq.com> * fix bug error: Missing positional arguments "length", "byteorder" in call to "to_bytes" of "int" Signed-off-by: a798347923 <2645302020@qq.com> * fix bug in fault tolerance mode Signed-off-by: w00689259 <wangzhuo66@huawei.com> * fix bug in fault tolerance mode Signed-off-by: w00689259 <wangzhuo66@huawei.com> * change fault_report_port to internal_fault_report_port add external_fault_notify_port Signed-off-by: a798347923 <2645302020@qq.com> * change fault_report_port to internal_fault_report_port add external_fault_notify_port Signed-off-by: a798347923 <2645302020@qq.com> * add _recv_cmd func use deserialize_method_call and run_method in run func Signed-off-by: a798347923 <2645302020@qq.com> * Update core.py fix bug error: Need type annotation for "kwargs" (hint: "kwargs: dict[<type>, <type>] = ...") Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com> * add self.ctx.term() in shutdown() Signed-off-by: a798347923 <2645302020@qq.com> * changed import deserialize_method_call,serialize_method_call Signed-off-by: a798347923 <2645302020@qq.com> * changed init worker_guard in init_device Signed-off-by: a798347923 <2645302020@qq.com> * Update core.py add import serialize_method_call Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com> * Update gpu_worker.py changed init WorkerGuard in init_device Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com> * Update gpu_worker.py FIX BUG self.worker_guard: WorkerGuard|None = None Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com> * Update gpu_worker.py fix bug error: Argument 1 to "deserialize_method_call" has incompatible type "str | None"; expected "str" [arg-type] Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com> * Update gpu_worker.py ruff format Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com> * Update core.py ruff-format Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com> * actively send exception information Signed-off-by: w00689259 <wangzhuo66@huawei.com> * actively send exception information Signed-off-by: w00689259 <wangzhuo66@huawei.com> * actively send exception information Signed-off-by: w00689259 <wangzhuo66@huawei.com> * change engine_core_cmd_addr(str) to engine_core_cmd_addrs(list[str]) in EngineZmqAddresses Signed-off-by: a798347923 <2645302020@qq.com> * change engine_core_cmd_addr(str) to engine_core_cmd_addrs(list[str]) in EngineZmqAddresses Signed-off-by: a798347923 <2645302020@qq.com> * Update utils.py delete engine_core_cmd_addr in EngineZmqAddresses Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com> * Remove redundant configuration: fault-pub-port Signed-off-by: fangyuchu <fangyuchu@qq.com> * Send pause instructions after receiving fault info in ClientGuard Signed-off-by: fangyuchu <fangyuchu@qq.com> * change engine_core_guard_identities from dict[int, bytes] to list[bytes] Signed-off-by: a798347923 <2645302020@qq.com> * fix bug "only the worker guard of engine core 0 can receive messages sent from engine core guard Signed-off-by: a798347923 <2645302020@qq.com> * change local_rank to rank_in_group in WorkerGuard Signed-off-by: a798347923 <2645302020@qq.com> * changed del self.client_cmd_registry[int(unhealthy_engine.engine_id)] Signed-off-by: a798347923 <2645302020@qq.com> * add gloo communication timeout * fix some bug * add stateless_process_group gloo_comm_timeout * reconstruct fault receiver&fault handler Signed-off-by: w00689259 <wangzhuo66@huawei.com> * fix some bug * reconstruct fault receiver&fault handler Signed-off-by: w00689259 <wangzhuo66@huawei.com> * reconstruct fault receiver&fault handler Signed-off-by: w00689259 <wangzhuo66@huawei.com> * fix return format Signed-off-by: w00689259 <wangzhuo66@huawei.com> * fix return format Signed-off-by: w00689259 <wangzhuo66@huawei.com> * fix return format Signed-off-by: w00689259 <wangzhuo66@huawei.com> * add abort request * fix some bug * fix some bug * fix some bug * add dt for client guard Signed-off-by: w00689259 <wangzhuo66@huawei.com> * add dt for client guard Signed-off-by: w00689259 <wangzhuo66@huawei.com> * add dt for client guard Signed-off-by: w00689259 <wangzhuo66@huawei.com> * Implementation of two types of pause: a soft one by using flag signals and a hard one by aborting nccl communicators. Signed-off-by: fangyuchu <fangyuchu@qq.com> * Refine certain log forms and fix a minor bug in pause function. Signed-off-by: fangyuchu <fangyuchu@qq.com> * Refactor and abstract the recv_msg logic in CG,ECG,WG. Signed-off-by: fangyuchu <fangyuchu@qq.com> * Add and check method uuid when sending commands and receiving results. Signed-off-by: fangyuchu <fangyuchu@qq.com> * Abstract the logic of sending instructions and waiting responses from FaultHandler Signed-off-by: fangyuchu <fangyuchu@qq.com> * Add options in EngineCoreGuard to recv execution results from WorkerGuard Signed-off-by: fangyuchu <fangyuchu@qq.com> * Support worker reinitialization after hard pause; add task queue in FaultHandler to ensure sequential task execution Signed-off-by: fangyuchu <fangyuchu@qq.com> * resolve conflicts Signed-off-by: w00689259 <wangzhuo66@huawei.com> * resolve conflicts Signed-off-by: w00689259 <wangzhuo66@huawei.com> * resolve conflicts Signed-off-by: w00689259 <wangzhuo66@huawei.com> * resolve conflicts Signed-off-by: w00689259 <wangzhuo66@huawei.com> * resolve conflicts Signed-off-by: w00689259 <wangzhuo66@huawei.com> * resolve conflicts Signed-off-by: w00689259 <wangzhuo66@huawei.com> * add engine core ut Signed-off-by: w00689259 <wangzhuo66@huawei.com> * add engine core ut Signed-off-by: w00689259 <wangzhuo66@huawei.com> * Ensure WorkerGuard command execution returns result; fix missing set_device when TP>1 Signed-off-by: fangyuchu <fangyuchu@qq.com> * rename& format logger Signed-off-by: w00689259 <wangzhuo66@huawei.com> * rename& format logger Signed-off-by: w00689259 <wangzhuo66@huawei.com> * feat(nccl): enable non-blocking NCCL communicators to support ncclCommAbort Signed-off-by: fangyuchu <fangyuchu@qq.com> * reinit dp_group * fix bug * fix bug * fix bug * fix bug (vllm-project#54) * Move requests to waiting queue instead of abandoing them directly. Signed-off-by: fangyuchu <fangyuchu@qq.com> * add annotation Signed-off-by: w00689259 <wangzhuo66@huawei.com> * fix typos Signed-off-by: fangyuchu <fangyuchu@qq.com> --------- Signed-off-by: fangyuchu <fangyuchu@qq.com> Signed-off-by: a798347923 <2645302020@qq.com> Signed-off-by: TianZhuo <2770730562@qq.com> Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com> Signed-off-by: 205150940 <112750056+205150940@users.noreply.github.com> Signed-off-by: w00689259 <wangzhuo66@huawei.com> Signed-off-by: zWaNg3 <37772915+zWaNg3@users.noreply.github.com> Co-authored-by: zWaNg3 <37772915+zWaNg3@users.noreply.github.com> Co-authored-by: a798347923 <2645302020@qq.com> Co-authored-by: TianZhuo <2770730562@qq.com> Co-authored-by: 205150940 <112750056+205150940@users.noreply.github.com> Co-authored-by: a798347923 <39047817+a798347923@users.noreply.github.com> Co-authored-by: w00689259 <wangzhuo66@huawei.com> * Fix DT and zmq socket closing issues, updated names per feedback and reinitialize dp_group with new port Signed-off-by: fangyuchu <fangyuchu@qq.com> * Improve documentation and logging in API server Signed-off-by: fangyuchu <fangyuchu@qq.com> * Fix hanging issue in DT; fix hang when aborting communicators from Python side; use queue.Queue for engine_exception_q Signed-off-by: fangyuchu <fangyuchu@qq.com> * Refactor fault tolerance modules by renaming classes to Sentinel and converting engine_registry to a dict Signed-off-by: fangyuchu <fangyuchu@qq.com> * reject requests when engine is in fault status Signed-off-by: fangyuchu <fangyuchu@qq.com> * clear batch_queue for async scheduling Signed-off-by: fangyuchu <fangyuchu@qq.com> * Fix incorrect initialization of worker_cmd_socket in multi-node setups Signed-off-by: fangyuchu <fangyuchu@qq.com> * Switch from field to Field Signed-off-by: fangyuchu <fangyuchu@qq.com> * Unify start_engine_core_monitor in MPClient and CoreEngineProcManager to reduce duplication Signed-off-by: fangyuchu <fangyuchu@qq.com> * refactor(Sentinel): Abstract and refactor class to standardize fault … (vllm-project#84) * refactor(Sentinel): Abstract and refactor class to standardize fault tolerance logic Signed-off-by: w00689259 <wangzhuo66@huawei.com> * refactor(Sentinel): Abstract and refactor class to standardize fault tolerance logic Signed-off-by: w00689259 <wangzhuo66@huawei.com> * refactor(Sentinel): Abstract and refactor class to standardize fault tolerance logic Signed-off-by: w00689259 <wangzhuo66@huawei.com> * refactor(Sentinel): Abstract and refactor class to standardize fault tolerance logic Signed-off-by: w00689259 <wangzhuo66@huawei.com> * refactor(Sentinel): Abstract and refactor class to standardize fault tolerance logic Signed-off-by: w00689259 <wangzhuo66@huawei.com> * refactor(Sentinel): Abstract and refactor class to standardize fault tolerance logic Signed-off-by: w00689259 <wangzhuo66@huawei.com> * refactor(Sentinel): Abstract and refactor class to standardize fault tolerance logic Signed-off-by: w00689259 <wangzhuo66@huawei.com> * refactor(Sentinel): Abstract and refactor class to standardize fault tolerance logic Signed-off-by: w00689259 <wangzhuo66@huawei.com> * refactor(Sentinel): Abstract and refactor class to standardize fault tolerance logic Signed-off-by: w00689259 <wangzhuo66@huawei.com> * refactor(Sentinel): Abstract and refactor class to standardize fault tolerance logic Signed-off-by: w00689259 <wangzhuo66@huawei.com> * refactor(Sentinel): Abstract and refactor class to standardize fault tolerance logic Signed-off-by: w00689259 <wangzhuo66@huawei.com> * refactor(Sentinel): Abstract and refactor class to standardize fault tolerance logic Signed-off-by: w00689259 <wangzhuo66@huawei.com> --------- Signed-off-by: w00689259 <wangzhuo66@huawei.com> Co-authored-by: w00689259 <wangzhuo66@huawei.com> Signed-off-by: zWaNg3 <389750525@qq.com> * fix bug in tests Signed-off-by: w00689259 <wangzhuo66@huawei.com> Signed-off-by: zWaNg3 <389750525@qq.com> * fix bug in tests Signed-off-by: w00689259 <wangzhuo66@huawei.com> Signed-off-by: zWaNg3 <389750525@qq.com> * refactor: improve naming and add comments for readability Signed-off-by: fangyuchu <fangyuchu@qq.com> * Pass fault_tolerance_config through process group creation for future extensibility Signed-off-by: fangyuchu <fangyuchu@qq.com> * Switch to native preempt_request implementation Signed-off-by: fangyuchu <fangyuchu@qq.com> * Rename base_sentinel.py Signed-off-by: fangyuchu <fangyuchu@qq.com> * refactor(api_server): Split fault_tolerance interfaces into standalone files Signed-off-by: fangyuchu <fangyuchu@qq.com> * Use zmq poll for socket receive in Sentinel DT to avoid hanging Signed-off-by: fangyuchu <fangyuchu@qq.com> * Add shutdown-on-fault-tolerance-failure config option Signed-off-by: fangyuchu <fangyuchu@qq.com> * ClientSentinel: add extra check to prevent repeated pause commands on error Signed-off-by: fangyuchu <fangyuchu@qq.com> * feat(pause): apply pause with target index Signed-off-by: zWaNg3 <389750525@qq.com> * Add middleware for fault tolerance Signed-off-by: fangyuchu <fangyuchu@qq.com> * fix engine_actor monitoring function bug Signed-off-by: TianZhuo <2770730562@qq.com> * fix engine_actor monitoring bug Signed-off-by: TianZhuo <2770730562@qq.com> * logger output format Signed-off-by: TianZhuo <2770730562@qq.com> * refactor(client_sentinel): support ClientSentinel-Client communication; refactor internal socket logic Signed-off-by: zWaNg3 <389750525@qq.com> * feat: add FaultToleranceRequest and FaultToleranceResult Signed-off-by: fangyuchu <fangyuchu@qq.com> * feat: add EngineStatusType enum and support paused state Signed-off-by: fangyuchu <fangyuchu@qq.com> * Unify the logic of engine monitor for engine process manager and engine actor manager Signed-off-by: fangyuchu <fangyuchu@qq.com> * Fix the hanging issue of ClientSentinel in the shutdown Signed-off-by: fangyuchu <fangyuchu@qq.com> * refactor(client_sentinel): rename process_ft_requests_loop function and run function Signed-off-by: zWaNg3 <389750525@qq.com> * Use VllmConfig as the input of Sentinel Modules Signed-off-by: fangyuchu <fangyuchu@qq.com> * Remove redundant @DataClass from FaultToleranceConfig Signed-off-by: fangyuchu <fangyuchu@qq.com> * Move hardcoded vllm_fault topic string into FaultToleranceConfig Signed-off-by: fangyuchu <fangyuchu@qq.com> * Update corresponding tests to new ClientSentinel design. Signed-off-by: fangyuchu <fangyuchu@qq.com> * Update engine core sentinel tests. Signed-off-by: fangyuchu <fangyuchu@qq.com> * Fix incorrect device settings in the pause of worker sentinel. Signed-off-by: fangyuchu <fangyuchu@qq.com> * Code cleanup and readability improvements Signed-off-by: fangyuchu <fangyuchu@qq.com> * Simplify FaultInfo and improve the readability Signed-off-by: fangyuchu <fangyuchu@qq.com> * Move sentinels into one file Signed-off-by: fangyuchu <fangyuchu@qq.com> * Remove recv_router_dealer_message Signed-off-by: fangyuchu <fangyuchu@qq.com> * Simplify the code in BaseSentinel Signed-off-by: fangyuchu <fangyuchu@qq.com> * Simplify the code in EngineCoreSentinel Signed-off-by: fangyuchu <fangyuchu@qq.com> * Introduce fault_tolerance utils and address dataclass Signed-off-by: fangyuchu <fangyuchu@qq.com> * refactor: split different sentinels into separate files Signed-off-by: fangyuchu <fangyuchu@qq.com> * refactor: split worker sentinel into v1/worker/sentinel for better plugin support and hardware adaptation Signed-off-by: fangyuchu <fangyuchu@qq.com> * remove ThreadSafeDict Signed-off-by: fangyuchu <fangyuchu@qq.com> * Simplify the communication between client, client sentinel and engine core sentinel (vllm-project#137) * refactor(client_sentinel): use core_client input_socket to broadcast ft_requst Signed-off-by: zWaNg3 <389750525@qq.com> * refactor(client_sentinel): use core_client input_socket to broadcast ft_requst Signed-off-by: zWaNg3 <389750525@qq.com> * refactor(client_sentinel): use core_client input_socket to broadcast ft_requst Signed-off-by: zWaNg3 <389750525@qq.com> * add _send_utility_result in ClientSentinel Signed-off-by: fangyuchu <fangyuchu@qq.com> * Processes fault-tolerant requests and forwards them to output. Signed-off-by: yzchang-plus <1078477584@qq.com> * replace uncertain code with TODO Signed-off-by: yzchang-plus <1078477584@qq.com> * add monitoring logic in client sentinel and implement thread-safe pause in monitoring. Signed-off-by: fangyuchu <fangyuchu@qq.com> * refactor(client_sentinel): send ft request using input_address Signed-off-by: zWaNg3 <389750525@qq.com> * refactor(client_sentinel): return ft result to client Signed-off-by: zWaNg3 <389750525@qq.com> * Use call_utility_async for interactions between client, client_sentinel and engine core sentinel Signed-off-by: fangyuchu <fangyuchu@qq.com> * Rename engine recovery timeout config Signed-off-by: fangyuchu <fangyuchu@qq.com> * Remove upstream and downstream concept from the base sentinel Signed-off-by: fangyuchu <fangyuchu@qq.com> * Support passing stateless dp port to retry Signed-off-by: fangyuchu <fangyuchu@qq.com> * Improve the shutdown of client sentinel Signed-off-by: fangyuchu <fangyuchu@qq.com> * Add try except for handle_fault in engine core Signed-off-by: fangyuchu <fangyuchu@qq.com> --------- Signed-off-by: zWaNg3 <389750525@qq.com> Signed-off-by: fangyuchu <fangyuchu@qq.com> Signed-off-by: yzchang-plus <1078477584@qq.com> Co-authored-by: zWaNg3 <389750525@qq.com> Co-authored-by: yzchang-plus <1078477584@qq.com> --------- Signed-off-by: fangyuchu <fangyuchu@qq.com> Signed-off-by: a798347923 <2645302020@qq.com> Signed-off-by: TianZhuo <2770730562@qq.com> Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com> Signed-off-by: 205150940 <112750056+205150940@users.noreply.github.com> Signed-off-by: w00689259 <wangzhuo66@huawei.com> Signed-off-by: zWaNg3 <37772915+zWaNg3@users.noreply.github.com> Signed-off-by: zWaNg3 <389750525@qq.com> Signed-off-by: yzchang-plus <1078477584@qq.com> Co-authored-by: zWaNg3 <37772915+zWaNg3@users.noreply.github.com> Co-authored-by: a798347923 <2645302020@qq.com> Co-authored-by: TianZhuo <2770730562@qq.com> Co-authored-by: 205150940 <112750056+205150940@users.noreply.github.com> Co-authored-by: a798347923 <39047817+a798347923@users.noreply.github.com> Co-authored-by: w00689259 <wangzhuo66@huawei.com> Co-authored-by: zWaNg3 <389750525@qq.com> Co-authored-by: yzchang-plus <1078477584@qq.com> Signed-off-by: fangyuchu <fangyuchu@qq.com> * refactor(dt tests of sentinels): add dt tests for sentinels Signed-off-by: zWaNg3 <389750525@qq.com> Signed-off-by: fangyuchu <fangyuchu@qq.com> * Remove torch.cuda API call (vllm-project#148) * Remove torch.cuda API call Signed-off-by: fangyuchu <fangyuchu@qq.com> * Remove unwanted shutdown Signed-off-by: fangyuchu <fangyuchu@qq.com> --------- Signed-off-by: fangyuchu <fangyuchu@qq.com> * Fault Tolerant EP: Implement fault-report Signed-off-by: fangyuchu <fangyuchu@qq.com> * merge engine monitor codes Signed-off-by: fangyuchu <fangyuchu@qq.com> * Move FT router attachment point and simplify FaultInfo initialization logic Signed-off-by: fangyuchu <fangyuchu@qq.com> * Revise DT for Fault Report Signed-off-by: fangyuchu <fangyuchu@qq.com> * Fix incorrect count of engine core index Signed-off-by: fangyuchu <fangyuchu@qq.com> * Update engine process monitoring codes Signed-off-by: fangyuchu <fangyuchu@qq.com> * [Bugfix] revise engine monitor logic on account of dead processes Signed-off-by: fangyuchu <fangyuchu@qq.com> * Improve the format of the fault report json Signed-off-by: fangyuchu <fangyuchu@qq.com> * Fix incorrect shutdown of engine manager Signed-off-by: fangyuchu <fangyuchu@qq.com> * Avoid error logging in normal shutdown Signed-off-by: fangyuchu <fangyuchu@qq.com> * handle zmq error Signed-off-by: fangyuchu <fangyuchu@qq.com> --------- Signed-off-by: fangyuchu <fangyuchu@qq.com> Signed-off-by: a798347923 <2645302020@qq.com> Signed-off-by: TianZhuo <2770730562@qq.com> Signed-off-by: a798347923 <39047817+a798347923@users.noreply.github.com> Signed-off-by: 205150940 <112750056+205150940@users.noreply.github.com> Signed-off-by: w00689259 <wangzhuo66@huawei.com> Signed-off-by: zWaNg3 <37772915+zWaNg3@users.noreply.github.com> Signed-off-by: zWaNg3 <389750525@qq.com> Signed-off-by: yzchang-plus <1078477584@qq.com> Co-authored-by: zWaNg3 <37772915+zWaNg3@users.noreply.github.com> Co-authored-by: a798347923 <2645302020@qq.com> Co-authored-by: TianZhuo <2770730562@qq.com> Co-authored-by: 205150940 <112750056+205150940@users.noreply.github.com> Co-authored-by: a798347923 <39047817+a798347923@users.noreply.github.com> Co-authored-by: w00689259 <wangzhuo66@huawei.com> Co-authored-by: zWaNg3 <389750525@qq.com> Co-authored-by: yzchang-plus <1078477584@qq.com> Signed-off-by: fangyuchu <fangyuchu@qq.com>

`https://pytorch.org/docs/stable/objects.inv` started returning 404 because pytorch/docs PR vllm-project#84 (merged 2026-04-29 23:09 UTC) replaced the previous filesystem symlinks under `stable/` with per-page HTML redirect stubs. Non-HTML build artifacts such as `objects.inv`, `searchindex.js` and `_static/version_switcher.json` were not stubbed or copied over, so any tool that fetches them under `/stable/` 404s. Every Read the Docs build of `latest` since 2026-04-29 23:59 UTC aborts with: ERROR - mkdocstrings: Couldn't load inventory https://pytorch.org/docs/stable/objects.inv through handler 'python': HTTP Error 404: Not Found Aborted with 1 errors, 20 warnings in strict mode\! Temporary fix: point at `pytorch.org/docs/2.11/objects.inv`, which is what `/stable/` aliases to today (verified: `docs.pytorch.org/docs/stable/index.html` is a JS redirect to `../2.11/index.html`, and 2.11 is the latest PyTorch release on PyPI). This needs a one-line bump on the next stable release. Revert to `/stable/` once the upstream stub script is fixed to carry non-HTML assets through. Signed-off-by: Stefano Castagnetta <scastagnetta@nvidia.com>

WoosukKwon added 27 commits May 6, 2023 22:11

Move

6c9799a

http_frontend -> frontend

bacf49c

Move

a16c1f3

Move controller

9b868d1

Minor

b2ff569

Fix import errors

38b946b

Move controller back to worker

62c7175

Rename

26aafdc

mv

8419df6

Add __init__.py

d9520b4

Minor

61bf05f

Move set_random_seeds

799ce53

Fix imports

b6f6d4c

Extract out initialize_dummy_weights

a25b37d

Minor

e2ea5cc

Minor

de47b95

Fix import errors on parallel utils

724dc90

Add __init__.py

fd2647f

Fix parallel_utils

7755a7a

Minor

a95bd42

Fix weight loading

4dc8e9e

Annotate types

e6ffa80

Fix type

da591aa

sample -> sampler

0ae70da

Minor

f1d2700

Merge branch 'main' into refactor-arch

338b2f4

Do not use fast llama tokenizer

3842987

WoosukKwon changed the title ~~Do not use LLaMA fast tokenizer~~ Use slow tokenizer for LLaMA May 8, 2023

WoosukKwon added 2 commits May 9, 2023 22:52

Merge branch 'main' into tokenizer

acb2855

Fix merge errors

be3f6c7

WoosukKwon added 2 commits May 9, 2023 22:56

Add a tracking issue in comment

729e14b

Minor refactoring

9fe9fbf

WoosukKwon merged commit 85eb631 into main May 9, 2023

WoosukKwon deleted the tokenizer branch May 9, 2023 23:03

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Use slow tokenizer for LLaMA (vllm-project#84)

cee6d9e

dllehr-amd pushed a commit to dllehr-amd/vllm that referenced this pull request Jul 22, 2024

add TP=1 moe tuning for mixtral-8x7B (vllm-project#84)

cddc83f

jikunshang pushed a commit to jikunshang/vllm that referenced this pull request Aug 15, 2024

Introduce delayed sampling mechanism (vllm-project#84)

77e1ab8

Co-authored-by: Krzysztof Laskowski <klaskowski@habana.ai>

JackChuang mentioned this pull request Mar 17, 2025

[Feature]: Introduce a Triton-only Transformer Execution Path in SGLang bytedance-iaas/vllm#82

Closed

dtrifiro pushed a commit to dtrifiro/vllm that referenced this pull request Apr 17, 2025

Merge pull request vllm-project#84 from moulalis/rename_220

754ea37

rename files

iwooook pushed a commit to moreh-dev/vllm that referenced this pull request Nov 29, 2025

Fix for latest tt-metal - replace ttnn get_devices calls with get_num…

e222ec8

…_devices (vllm-project#84) Signed-off-by: Salar <skhorasgani@tenstorrent.com> (cherry picked from commit 5999673)

ivnle mentioned this pull request Jan 8, 2026

Use runtime profiling to replace manual memory analyzers #81

Merged

tjtanaa pushed a commit to tjtanaa/vllm that referenced this pull request Jan 29, 2026

[Bugfix] Fix removal of old logs when stats are enabled (vllm-project#84

f4633fe

) Signed-off-by: syedmba <syedmba7@connect.hku.hk>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use slow tokenizer for LLaMA#84

Use slow tokenizer for LLaMA#84
WoosukKwon merged 31 commits intomainfrom
tokenizer

WoosukKwon commented May 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

WoosukKwon commented May 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant