Skip to content

Commit

Permalink
Add --download option to run_regression.py (#1893)
Browse files Browse the repository at this point in the history
Option automatically downloads the corpus from our servers.
  • Loading branch information
lintool authored Jun 3, 2022
1 parent dd5f2e8 commit 236b386
Show file tree
Hide file tree
Showing 75 changed files with 1,060 additions and 346 deletions.
16 changes: 8 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,8 +62,8 @@ See individual pages for details!
| doc2query | [+](docs/regressions-msmarco-passage-doc2query.md) |
| doc2query-T5 | [+](docs/regressions-msmarco-passage-docTTTTTquery.md) | [+](docs/regressions-dl19-passage-docTTTTTquery.md) | [+](docs/regressions-dl20-passage-docTTTTTquery.md) |
| **Learned sparse lexical (uniCOIL family)** |
| uniCOIL noexp | [+](docs/regressions-msmarco-passage-unicoil-noexp.md) | [+](docs/regressions-dl19-passage-unicoil-noexp.md) | [+](docs/regressions-dl20-passage-unicoil-noexp.md) |
| uniCOIL with doc2query-T5 | [+](docs/regressions-msmarco-passage-unicoil.md) | [+](docs/regressions-dl19-passage-unicoil.md) | [+](docs/regressions-dl20-passage-unicoil.md) |
| uniCOIL noexp | [](docs/regressions-msmarco-passage-unicoil-noexp.md) | [](docs/regressions-dl19-passage-unicoil-noexp.md) | [](docs/regressions-dl20-passage-unicoil-noexp.md) |
| uniCOIL with doc2query-T5 | [](docs/regressions-msmarco-passage-unicoil.md) | [](docs/regressions-dl19-passage-unicoil.md) | [](docs/regressions-dl20-passage-unicoil.md) |
| uniCOIL with TILDE | [+](docs/regressions-msmarco-passage-unicoil-tilde-expansion.md) |
| **Learned sparse lexical (other)** |
| DeepImpact | [+](docs/regressions-msmarco-passage-deepimpact.md) |
Expand All @@ -83,8 +83,8 @@ See individual pages for details!
| WP baselines | [+](docs/regressions-msmarco-doc-segmented-wp.md) | [+](docs/regressions-dl19-doc-segmented-wp.md) | [+](docs/regressions-dl20-doc-segmented-wp.md) |
| doc2query-T5 | [+](docs/regressions-msmarco-doc-segmented-docTTTTTquery.md) | [+](docs/regressions-dl19-doc-segmented-docTTTTTquery.md) | [+](docs/regressions-dl20-doc-segmented-docTTTTTquery.md) |
| **Learned sparse lexical** |
| uniCOIL noexp | [+](docs/regressions-msmarco-doc-segmented-unicoil-noexp.md) | [+](docs/regressions-dl19-doc-segmented-unicoil-noexp.md) | [+](docs/regressions-dl20-doc-segmented-unicoil-noexp.md) |
| uniCOIL with doc2query-T5 | [+](docs/regressions-msmarco-doc-segmented-unicoil.md) | [+](docs/regressions-dl19-doc-segmented-unicoil.md) | [+](docs/regressions-dl20-doc-segmented-unicoil.md) |
| uniCOIL noexp | [](docs/regressions-msmarco-doc-segmented-unicoil-noexp.md) | [](docs/regressions-dl19-doc-segmented-unicoil-noexp.md) | [](docs/regressions-dl20-doc-segmented-unicoil-noexp.md) |
| uniCOIL with doc2query-T5 | [](docs/regressions-msmarco-doc-segmented-unicoil.md) | [](docs/regressions-dl19-doc-segmented-unicoil.md) | [](docs/regressions-dl20-doc-segmented-unicoil.md) |

### MS MARCO V2 Passage Corpus

Expand All @@ -97,8 +97,8 @@ See individual pages for details!
| baselines | [+](docs/regressions-msmarco-v2-passage-augmented.md) | [+](docs/regressions-dl21-passage-augmented.md) |
| doc2query-T5 | [+](docs/regressions-msmarco-v2-passage-augmented-d2q-t5.md) | [+](docs/regressions-dl21-passage-augmented-d2q-t5.md) |
| **Learned sparse lexical** |
| uniCOIL noexp zero-shot | [+](docs/regressions-msmarco-v2-passage-unicoil-noexp-0shot.md) | [+](docs/regressions-dl21-passage-unicoil-noexp-0shot.md) |
| uniCOIL with doc2query-T5 zero-shot | [+](docs/regressions-msmarco-v2-passage-unicoil-0shot.md) | [+](docs/regressions-dl21-passage-unicoil-0shot.md) |
| uniCOIL noexp zero-shot | [](docs/regressions-msmarco-v2-passage-unicoil-noexp-0shot.md) | [](docs/regressions-dl21-passage-unicoil-noexp-0shot.md) |
| uniCOIL with doc2query-T5 zero-shot | [](docs/regressions-msmarco-v2-passage-unicoil-0shot.md) | [](docs/regressions-dl21-passage-unicoil-0shot.md) |

### MS MARCO V2 Document Corpus

Expand All @@ -111,8 +111,8 @@ See individual pages for details!
| baselines | [+](docs/regressions-msmarco-v2-doc-segmented.md) | [+](docs/regressions-dl21-doc-segmented.md) |
| doc2query-T5 | [+](docs/regressions-msmarco-v2-doc-segmented-d2q-t5.md) | [+](docs/regressions-dl21-doc-segmented-d2q-t5.md) |
| **Learned sparse lexical** |
| uniCOIL noexp zero-shot | [+](docs/regressions-msmarco-v2-doc-segmented-unicoil-noexp-0shot-v2.md) | [+](docs/regressions-dl21-doc-segmented-unicoil-noexp-0shot-v2.md) |
| uniCOIL with doc2query-T5 zero-shot | [+](docs/regressions-msmarco-v2-doc-segmented-unicoil-0shot-v2.md) | [+](docs/regressions-dl21-doc-segmented-unicoil-0shot-v2.md) |
| uniCOIL noexp zero-shot | [](docs/regressions-msmarco-v2-doc-segmented-unicoil-noexp-0shot-v2.md) | [](docs/regressions-dl21-doc-segmented-unicoil-noexp-0shot-v2.md) |
| uniCOIL with doc2query-T5 zero-shot | [](docs/regressions-msmarco-v2-doc-segmented-unicoil-0shot-v2.md) | [](docs/regressions-dl21-doc-segmented-unicoil-0shot-v2.md) |

### Regressions for BEIR (v1.0.0)

Expand Down
20 changes: 12 additions & 8 deletions docs/regressions-dl19-doc-segmented-unicoil-noexp.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,18 @@ From one of our Waterloo servers (e.g., `orca`), the following command will perf
python src/main/python/run_regression.py --index --verify --search --regression dl19-doc-segmented-unicoil-noexp
```

## Corpus Download

We make available a version of the MS MARCO segmented document corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting.
We make available a version of the MS MARCO document corpus that has already been processed with uniCOIL, i.e., we have performed model inference on every document and stored the output sparse vectors.
Thus, no neural inference is involved.
For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL).

From any machine, the following command will download the corpus and perform the complete regression, end to end:

```bash
python src/main/python/run_regression.py --download --index --verify --search --regression dl19-doc-segmented-unicoil-noexp
```

The `run_regression.py` script automates the following steps, but if you want to perform each step manually, simply copy/paste from the commands below and you'll obtain the same regression results.

## Corpus Download

Download the corpus and unpack into `collections/`:

Expand All @@ -34,16 +41,13 @@ tar xvf collections/msmarco-doc-segmented-unicoil-noexp.tar -C collections/
```

To confirm, `msmarco-doc-segmented-unicoil-noexp.tar` is 11 GB and has MD5 checksum `11b226e1cacd9c8ae0a660fd14cdd710`.

With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine:
With the corpus downloaded, the following command will perform the remaining steps below:

```bash
python src/main/python/run_regression.py --index --verify --search --regression dl19-doc-segmented-unicoil-noexp \
--corpus-path collections/msmarco-doc-segmented-unicoil-noexp
```

Alternatively, you can simply copy/paste from the commands below and obtain the same results.

## Indexing

Sample indexing command:
Expand Down
20 changes: 12 additions & 8 deletions docs/regressions-dl19-doc-segmented-unicoil.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,18 @@ From one of our Waterloo servers (e.g., `orca`), the following command will perf
python src/main/python/run_regression.py --index --verify --search --regression dl19-doc-segmented-unicoil
```

## Corpus Download

We make available a version of the MS MARCO segmented document corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting.
We make available a version of the MS MARCO document corpus that has already been processed with uniCOIL, i.e., we have applied doc2query-T5 expansions, performed model inference on every document, and stored the output sparse vectors.
Thus, no neural inference is involved.
For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL).

From any machine, the following command will download the corpus and perform the complete regression, end to end:

```bash
python src/main/python/run_regression.py --download --index --verify --search --regression dl19-doc-segmented-unicoil
```

The `run_regression.py` script automates the following steps, but if you want to perform each step manually, simply copy/paste from the commands below and you'll obtain the same regression results.

## Corpus Download

Download the corpus and unpack into `collections/`:

Expand All @@ -34,16 +41,13 @@ tar xvf collections/msmarco-doc-segmented-unicoil.tar -C collections/
```

To confirm, `msmarco-doc-segmented-unicoil.tar` is 19 GB and has MD5 checksum `6a00e2c0c375cb1e52c83ae5ac377ebb`.

With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine:
With the corpus downloaded, the following command will perform the remaining steps below:

```bash
python src/main/python/run_regression.py --index --verify --search --regression dl19-doc-segmented-unicoil \
--corpus-path collections/msmarco-doc-segmented-unicoil
```

Alternatively, you can simply copy/paste from the commands below and obtain the same results.

## Indexing

Sample indexing command:
Expand Down
20 changes: 12 additions & 8 deletions docs/regressions-dl19-passage-unicoil-noexp.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,18 @@ From one of our Waterloo servers (e.g., `orca`), the following command will perf
python src/main/python/run_regression.py --index --verify --search --regression dl19-passage-unicoil-noexp
```

## Corpus Download

We make available a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting.
We make available a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., we have performed model inference on every document and stored the output sparse vectors.
Thus, no neural inference is involved.
For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL).

From any machine, the following command will download the corpus and perform the complete regression, end to end:

```bash
python src/main/python/run_regression.py --download --index --verify --search --regression dl19-passage-unicoil-noexp
```

The `run_regression.py` script automates the following steps, but if you want to perform each step manually, simply copy/paste from the commands below and you'll obtain the same regression results.

## Corpus Download

Download the corpus and unpack into `collections/`:

Expand All @@ -36,16 +43,13 @@ tar xvf collections/msmarco-passage-unicoil-noexp.tar -C collections/
```

To confirm, `msmarco-passage-unicoil-noexp.tar` is 2.7 GB and has MD5 checksum `f17ddd8c7c00ff121c3c3b147d2e17d8`.

With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine:
With the corpus downloaded, the following command will perform the remaining steps below:

```bash
python src/main/python/run_regression.py --index --verify --search --regression dl19-passage-unicoil-noexp \
--corpus-path collections/msmarco-passage-unicoil-noexp
```

Alternatively, you can simply copy/paste from the commands below and obtain the same results.

## Indexing

Sample indexing command:
Expand Down
20 changes: 12 additions & 8 deletions docs/regressions-dl19-passage-unicoil.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,18 @@ From one of our Waterloo servers (e.g., `orca`), the following command will perf
python src/main/python/run_regression.py --index --verify --search --regression dl19-passage-unicoil
```

## Corpus Download

We make available a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting.
We make available a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., we have applied doc2query-T5 expansions, performed model inference on every document, and stored the output sparse vectors.
Thus, no neural inference is involved.
For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL).

From any machine, the following command will download the corpus and perform the complete regression, end to end:

```bash
python src/main/python/run_regression.py --download --index --verify --search --regression dl19-passage-unicoil
```

The `run_regression.py` script automates the following steps, but if you want to perform each step manually, simply copy/paste from the commands below and you'll obtain the same regression results.

## Corpus Download

Download the corpus and unpack into `collections/`:

Expand All @@ -36,16 +43,13 @@ tar xvf collections/msmarco-passage-unicoil.tar -C collections/
```

To confirm, `msmarco-passage-unicoil.tar` is 3.4 GB and has MD5 checksum `78eef752c78c8691f7d61600ceed306f`.

With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine:
With the corpus downloaded, the following command will perform the remaining steps below:

```bash
python src/main/python/run_regression.py --index --verify --search --regression dl19-passage-unicoil \
--corpus-path collections/msmarco-passage-unicoil
```

Alternatively, you can simply copy/paste from the commands below and obtain the same results.

## Indexing

Sample indexing command:
Expand Down
20 changes: 12 additions & 8 deletions docs/regressions-dl20-doc-segmented-unicoil-noexp.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,18 @@ From one of our Waterloo servers (e.g., `orca`), the following command will perf
python src/main/python/run_regression.py --index --verify --search --regression dl20-doc-segmented-unicoil-noexp
```

## Corpus Download

We make available a version of the MS MARCO segmented document corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting.
We make available a version of the MS MARCO document corpus that has already been processed with uniCOIL, i.e., we have performed model inference on every document and stored the output sparse vectors.
Thus, no neural inference is involved.
For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL).

From any machine, the following command will download the corpus and perform the complete regression, end to end:

```bash
python src/main/python/run_regression.py --download --index --verify --search --regression dl20-doc-segmented-unicoil-noexp
```

The `run_regression.py` script automates the following steps, but if you want to perform each step manually, simply copy/paste from the commands below and you'll obtain the same regression results.

## Corpus Download

Download the corpus and unpack into `collections/`:

Expand All @@ -34,16 +41,13 @@ tar xvf collections/msmarco-doc-segmented-unicoil-noexp.tar -C collections/
```

To confirm, `msmarco-doc-segmented-unicoil-noexp.tar` is 11 GB and has MD5 checksum `11b226e1cacd9c8ae0a660fd14cdd710`.

With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine:
With the corpus downloaded, the following command will perform the remaining steps below:

```bash
python src/main/python/run_regression.py --index --verify --search --regression dl20-doc-segmented-unicoil-noexp \
--corpus-path collections/msmarco-doc-segmented-unicoil-noexp
```

Alternatively, you can simply copy/paste from the commands below and obtain the same results.

## Indexing

Sample indexing command:
Expand Down
20 changes: 12 additions & 8 deletions docs/regressions-dl20-doc-segmented-unicoil.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,18 @@ From one of our Waterloo servers (e.g., `orca`), the following command will perf
python src/main/python/run_regression.py --index --verify --search --regression dl20-doc-segmented-unicoil
```

## Corpus Download

We make available a version of the MS MARCO segmented document corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting.
We make available a version of the MS MARCO document corpus that has already been processed with uniCOIL, i.e., we have applied doc2query-T5 expansions, performed model inference on every document, and stored the output sparse vectors.
Thus, no neural inference is involved.
For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL).

From any machine, the following command will download the corpus and perform the complete regression, end to end:

```bash
python src/main/python/run_regression.py --download --index --verify --search --regression dl20-doc-segmented-unicoil
```

The `run_regression.py` script automates the following steps, but if you want to perform each step manually, simply copy/paste from the commands below and you'll obtain the same regression results.

## Corpus Download

Download the corpus and unpack into `collections/`:

Expand All @@ -34,16 +41,13 @@ tar xvf collections/msmarco-doc-segmented-unicoil.tar -C collections/
```

To confirm, `msmarco-doc-segmented-unicoil.tar` is 19 GB and has MD5 checksum `6a00e2c0c375cb1e52c83ae5ac377ebb`.

With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine:
With the corpus downloaded, the following command will perform the remaining steps below:

```bash
python src/main/python/run_regression.py --index --verify --search --regression dl20-doc-segmented-unicoil \
--corpus-path collections/msmarco-doc-segmented-unicoil
```

Alternatively, you can simply copy/paste from the commands below and obtain the same results.

## Indexing

Sample indexing command:
Expand Down
20 changes: 12 additions & 8 deletions docs/regressions-dl20-passage-unicoil-noexp.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,18 @@ From one of our Waterloo servers (e.g., `orca`), the following command will perf
python src/main/python/run_regression.py --index --verify --search --regression dl20-passage-unicoil-noexp
```

## Corpus Download

We make available a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., gone through document expansion and term reweighting.
We make available a version of the MS MARCO passage corpus that has already been processed with uniCOIL, i.e., we have performed model inference on every document and stored the output sparse vectors.
Thus, no neural inference is involved.
For details on how to train uniCOIL and perform inference, please see [this guide](https://github.com/luyug/COIL/tree/main/uniCOIL).

From any machine, the following command will download the corpus and perform the complete regression, end to end:

```bash
python src/main/python/run_regression.py --download --index --verify --search --regression dl20-passage-unicoil-noexp
```

The `run_regression.py` script automates the following steps, but if you want to perform each step manually, simply copy/paste from the commands below and you'll obtain the same regression results.

## Corpus Download

Download the corpus and unpack into `collections/`:

Expand All @@ -36,16 +43,13 @@ tar xvf collections/msmarco-passage-unicoil-noexp.tar -C collections/
```

To confirm, `msmarco-passage-unicoil-noexp.tar` is 2.7 GB and has MD5 checksum `f17ddd8c7c00ff121c3c3b147d2e17d8`.

With the corpus downloaded, the following command will perform the complete regression, end to end, on any machine:
With the corpus downloaded, the following command will perform the remaining steps below:

```bash
python src/main/python/run_regression.py --index --verify --search --regression dl20-passage-unicoil-noexp \
--corpus-path collections/msmarco-passage-unicoil-noexp
```

Alternatively, you can simply copy/paste from the commands below and obtain the same results.

## Indexing

Sample indexing command:
Expand Down
Loading

0 comments on commit 236b386

Please sign in to comment.