The project started out forking an older version of MineRL to get human interaction with the agent working, which was broken in the original JarvisVLA's MineStudio implementation. I then noticed JarvisVLA failing in simple tasks like mining oak logs in MineRL 0.3.7's minecraft env, while succeeding in the original JarvisVLA repo's minecraft env.
The embedded JarvisVLA-oak is a fork of JarvisVLA that runs on my machine (Ubuntu 22.04.5 LTS, RTX 3090 Ti with CUDA 13) without conda. Follow the install instructions in the README and run "./run_oak_log_10.sh" to test JarvisVLA in the oak log gathering task for 10 iterations.
./full_install.sh
source agent_env/bin/activate
pip install -r requirements_agent.txtpython -m venv vllm_env
source vllm_env/bin/activate
pip install -r requirements_vllm.txtActivate agent venv
source agent_env/bin/activateIn another terminal run the VLLM server (hosted on port 3000 default):
source vllm_env/bin/activate
./vllm.shRun the agent on your task (in terminal with agent_env active):
python agent.py --task <your task prompt> --craft <item id>Where task is the text prompt to be sent to the VLA and craft is the programmatic name of a minecraft id that triggers environment completion when detected in the inventory.
For example to prompt the agent to get oak logs:
python agent.py --task "harvest oak logs from the tree" --craft oak_logThis will launch minecraft environment and the agent will take actions until max_steps is reached or oak_log is obtained.
To interact with the agent during inference, open another terminal while agent is still taking actions in environment and do:
python -m minerl.interactor 6666This will automatically spawn a new minecraft client and connect you to agent's server.
JarvisVLA is very brittle. I tried several prompts of the variation "hit tree and get logs," and across all of them success rate is low (<10%). In the success cases, JarvisVLA spawned withe tree trunk in the center of the screen, within or one or two blocks away from hitting range.
The two main failure modes were tree detection and attack spamming. In the first case, it would begin hitting objects that are not trees (dirt blocks or leaf blocks). In the second case, the agent spent all steps attacking and doing nothing else.
I suspect the cause of the failures is running the agent in a different minecraft version (1.12.1) and graphic settings relative to its training data.
So I tested with JarvisVLA official repo:
cd JarvisVLA-oak
./JarvisVLA-oak/run_oak_log_10.shResults videos will be saved in JarvisVLA-oak/logs/
Qualitatively/quantitatively the agent performs MUCH better (average 70% success rate out of 10 tries, and more complex behavior like moving around to actively find trees)!
To exactly replicate environment from JarvisVLA-oak, I copied over its options.txt file while removing newer featuers.
The primary changes to replicate minestudio was changing MineRL's old FOV (130 Quake Pro mode) to 70 (Minestudio's default), setting gamma (brightness) to 2.0, and doubling particle quality.
But still seeing large differences in behavior of the agent - making me suspect something else was wrong. Specifically, the agent was getting stuck in loops where it would output just the attack token over and over again.
Example log:
[Step 728] Getting action from agent...
task: Mine the oak log
[VLLM] Calling with 5 messages
[VLLM] Response: <|reserved_special_token_178|><|reserved_special_token_204|><|reserved_special_token_219|><|reserved...
[VLLM] Time: 410.6ms
[VLLM] Extracted 5 special tokens: [151835, 151861, 151876, 151897, 151836]...
wall clock time: 458.30
Action: forward=0, jump=0, attack=1, camera=[0. 0.]I rechecked special token -> minerl action mapping to confirm its correct + message formatting being sent to VLLM.
I don't know the exact cause without ablating the working JarvisVLA-oak implementation exhaustively to replicate the behavior seen in MineRL 0.3.7.
But given the OpenAI format message being sent to the VLLM server hosting JarvisVLA is exactly the same in both repositories, I can say the cause is some Minecraft environment mismatch.
This was my first venture into getting a "VLA" model to work, and it immediately ran into brittleness issues with small environment changes. Got a taste of subtle shifts in environment breaking models that are strong on paper.
- spaces.py - Removed
self.shape = () - core.py - Fixed
collections.Mapping→collections.abc.Mapping - observables.py - Fixed
np.int→int - MalmoEnvServer.java - Added UUID generation (50+ lines)
- build.gradle - Configured for local MixinGradle