Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 7 additions & 61 deletions resources_servers/comp_coding/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

### Overview
Verifies competitive programming solutions by executing submitted code against unit tests. The server consumes agent trajectories and returns a reward based on whether the assistant's code produces the correct outputs for given test inputs.
Model registry link: https://gitlab-master.nvidia.com/bxyu/nemo-gym/-/ml/models/53#/
Model registry link:
- Train: https://gitlab-master.nvidia.com/bxyu/nemo-gym/-/ml/models/80#/
- Validation: https://gitlab-master.nvidia.com/bxyu/nemo-gym/-/ml/models/76#/

### Input schema
- `responses_create_params`: OpenAI Responses create params
Expand All @@ -11,64 +13,21 @@ Model registry link: https://gitlab-master.nvidia.com/bxyu/nemo-gym/-/ml/models/
- `unit_tests` (required): dict with `inputs` and `outputs` arrays containing test cases.
- `inputs`: list of strings representing stdin input for each test case
- `outputs`: list of strings representing expected stdout output for each test case
- `problem_id` (optional): unique identifier for the problem

**Notes**
- All test cases must pass for a solution to receive a reward of 1.0
- Failed test cases result in a reward of 0.0 with detailed error information

### Test execution (for now)
- Code is executed using Python's `exec()` function in a controlled environment
- Each test case runs with redirected stdin/stdout:
- `stdin` is populated with the test input
- `stdout` is captured for comparison with expected output
- Available built-ins include common functions: `input`, `print`, `range`, `len`, `int`, `str`, `list`, etc.
- Newlines in test data are properly handled (converts `\\n` to actual newlines)

### Example dataset row
```json
{
"responses_create_params": {
"input": [
{
"role": "user",
"content": "You are an expert competitive programmer. You will be given a problem statement and must output a complete Python solution that reads from stdin and writes to stdout.\n\nPolycarp has $n$ different binary words. A word called binary if it contains only characters '0' and '1'. For example, these words are binary: \"0001\", \"11\", \"0\" and \"0011100\".\n\nPolycarp wants to offer his set of $n$ binary words to play a game \"words\". In this game, players name words and each next word (starting from the second) must start with the last character of the previous word. The first word can be any. For example, these sequence of words can be named during the game: \"0101\", \"1\", \"10\", \"00\", \"00001\".\n\nWord reversal is the operation of reversing the order of the characters. For example, the word \"0111\" after the reversal becomes \"1110\", the word \"11010\" after the reversal becomes \"01011\".\n\nProbably, Polycarp has such a set of words that there is no way to put them in the order correspondent to the game rules. In this situation, he wants to reverse some words from his set so that: the final set of $n$ words still contains different words (i.e. all words are unique); there is a way to put all words of the final set of words in the order so that the final sequence of $n$ words is consistent with the game rules. \n\nPolycarp wants to reverse minimal number of words. Please, help him.\n\n\n-----Input-----\n\nThe first line of the input contains one integer $t$ ($1 \\le t \\le 10^4$) — the number of test cases in the input. Then $t$ test cases follow.\n\nThe first line of a test case contains one integer $n$ ($1 \\le n \\le 2\\cdot10^5$) — the number of words in the Polycarp's set. Next $n$ lines contain these words. All of $n$ words aren't empty and contains only characters '0' and '1'. The sum of word lengths doesn't exceed $4\\cdot10^6$. All words are different.\n\nGuaranteed, that the sum of $n$ for all test cases in the input doesn't exceed $2\\cdot10^5$. Also, guaranteed that the sum of word lengths for all test cases in the input doesn't exceed $4\\cdot10^6$.\n\n\n-----Output-----\n\nPrint answer for all of $t$ test cases in the order they appear.\n\nIf there is no answer for the test case, print -1. Otherwise, the first line of the output should contain $k$ ($0 \\le k \\le n$) — the minimal number of words in the set which should be reversed. The second line of the output should contain $k$ distinct integers — the indexes of the words in the set which should be reversed. Words are numerated from $1$ to $n$ in the order they appear. If $k=0$ you can skip this line (or you can print an empty line). If there are many answers you can print any of them.\n\n\n-----Example-----\nInput\n4\n4\n0001\n1000\n0011\n0111\n3\n010\n101\n0\n2\n00000\n00001\n4\n01\n001\n0001\n00001\n\nOutput\n1\n3 \n-1\n0\n\n2\n1 2"
}
]
},
"verifier_metadata": {
"problem_id": "c69268d8bdb4da0685d7b187c88296c1",
"unit_tests": {
"inputs": ["4\n4\n0001\n1000\n0011\n0111\n3\n010\n101\n0\n2\n00000\n00001\n4\n01\n001\n0001\n00001\n"],
"outputs": ["1\n3 \n-1\n0\n\n2\n1 2 \n"]
}
}
}
```
We use the LiveCodeBench execution code.

### Example of rollouts and usage

```bash
config_paths="responses_api_agents/simple_agent/configs/simple_agent.yaml,\
responses_api_models/openai_model/configs/openai_model.yaml,\
resources_servers/comp_coding/configs/comp_coding.yaml"

# Running the server
ng_run "+config_paths=[$config_paths]" \
+simple_agent.responses_api_agents.simple_agent.resources_server.name=comp_coding

# Prepare example data for validation
ng_prepare_data "+config_paths=[$config_paths]" \
+output_dirpath=resources_servers/comp_coding/data/ \
+mode=example_validation

# Download train data from gitlab model registry
ng_download_dataset_from_gitlab \
+dataset_name=comp_coding \
+version=0.1.1 \
+run_id=5a1167ef-3533-486f-9c0e-49d1e97fc887 \
+artifact_fpath=train.jsonl \
+output_fpath=resources_servers/comp_coding/data/train.jsonl
config_paths="responses_api_models/openai_model/configs/openai_model.yaml,\
resources_servers/comp_coding/configs/comp_coding.yaml"
ng_run "+config_paths=[$config_paths]"

# Collect rollouts from example problems
ng_collect_rollouts +agent_name=comp_coding_simple_agent \
Expand All @@ -77,18 +36,5 @@ ng_collect_rollouts +agent_name=comp_coding_simple_agent \
+limit=null
```

### Optional data preperation/validation scripts

```bash
# Build training dataset from collected examples
uv run python resources_servers/comp_coding/scripts/build_examples.py \
--out resources_servers/comp_coding/data/train.jsonl \
--split train[:5000]

# Validate and pre-process train dataset
uv run python resources_servers/comp_coding/scripts/validate_dataset.py \
--in data/comp_coding/train.jsonl --fail-fast
```

## Licensing information
Apache 2.0
20 changes: 16 additions & 4 deletions resources_servers/comp_coding/lcb_integration/testing_util.py
Original file line number Diff line number Diff line change
Expand Up @@ -440,7 +440,7 @@ def run_test(sample, test=None, debug=False, timeout=6):

# Disable functionalities that can make destructive changes to the test.
# max memory is set to 4GB
reliability_guard()
reliability_guard(4 * 1024**3)

if debug:
print(f"start = {datetime.now().time()}")
Expand Down Expand Up @@ -511,6 +511,17 @@ def run_test(sample, test=None, debug=False, timeout=6):
signal.alarm(0)


def _set_resource_limit(resource_type: str, target_soft_limit):
import resource

current_soft, current_hard = resource.getrlimit(resource_type)
if current_soft < target_soft_limit:
try:
resource.setrlimit(resource_type, (target_soft_limit, current_hard))
except ValueError as e:
print(repr(e))


def reliability_guard(maximum_memory_bytes=None):
"""
This disables various destructive functions and prevents the generated code
Expand All @@ -526,10 +537,11 @@ def reliability_guard(maximum_memory_bytes=None):
if maximum_memory_bytes is not None:
import resource

resource.setrlimit(resource.RLIMIT_AS, (maximum_memory_bytes, maximum_memory_bytes))
resource.setrlimit(resource.RLIMIT_DATA, (maximum_memory_bytes, maximum_memory_bytes))
_set_resource_limit(resource.RLIMIT_AS, maximum_memory_bytes)
_set_resource_limit(resource.RLIMIT_DATA, maximum_memory_bytes)

if not platform.uname().system == "Darwin":
resource.setrlimit(resource.RLIMIT_STACK, (maximum_memory_bytes, maximum_memory_bytes))
_set_resource_limit(resource.RLIMIT_STACK, maximum_memory_bytes)

faulthandler.disable()

Expand Down
Loading