Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add cli annotation subcommand #2539

Merged
merged 63 commits into from
Jul 29, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
1cfd810
feat: skeleton code for a proposed annotate cli subcmd
atolopko-czi May 12, 2022
9dcd230
format: fmt, lint
atolopko-czi May 12, 2022
39a74a0
add basic cell type annotation logic
atolopko-czi May 18, 2022
cfcfd00
enable cli annotate subcmd
atolopko-czi May 19, 2022
22b008f
add cli annotate tests
atolopko-czi May 19, 2022
2fdcfad
more cli annotate tests
atolopko-czi May 20, 2022
e8e3b21
make query and ref dataset genes congruent
atolopko-czi May 24, 2022
3acbb76
rm todo
atolopko-czi May 24, 2022
cf577c5
cli tweaks, test fix
atolopko-czi May 25, 2022
958055c
feat: add annotate cmd --min-common-gene-pct option
atolopko-czi May 25, 2022
e103533
add progress output to annotate cmd
atolopko-czi May 25, 2022
6f445ff
todo
atolopko-czi May 25, 2022
d4696c3
write annotate prediction metadata to h5ad
atolopko-czi Jun 1, 2022
9e462d2
add test
atolopko-czi Jun 1, 2022
dfb2372
add caching of downloaded model files
atolopko-czi Jun 1, 2022
040f333
Merge branch 'main' of github.com:chanzuckerberg/cellxgene into atolo…
atolopko-czi Jun 8, 2022
1c5be30
overhaul for use with new trained models
atolopko-czi Jun 9, 2022
4ab4cf4
refactoring
atolopko-czi Jun 9, 2022
019550f
add umap for query dataset in ref latent space
atolopko-czi Jun 9, 2022
24d4a7f
grammar
atolopko-czi Jun 9, 2022
92bca62
cleanup
atolopko-czi Jun 9, 2022
5969e27
fix remote model file retrieval
atolopko-czi Jun 14, 2022
b10e86d
pkg fixes
atolopko-czi Jun 14, 2022
f512b5c
pip requirements updates for annotate
atolopko-czi Jun 15, 2022
d90bd34
pip requirements fixes for annotate
atolopko-czi Jun 15, 2022
aca107f
add pycharm run config for annotation cmd
atolopko-czi Jun 24, 2022
baa12af
annotate --use-gpu option
atolopko-czi Jun 24, 2022
9f49958
pip pkg updates, M1 mac support
atolopko-czi Jun 28, 2022
4227e35
Merge branch 'main' of github.com:chanzuckerberg/cellxgene into atolo…
atolopko-czi Jul 19, 2022
ee581a5
retrieve and cache mlflow model as archive file
atolopko-czi Jul 20, 2022
f4e0015
mlflow prediction via cli, for managed python env
atolopko-czi Jul 21, 2022
a22f8c7
tail mlflow process output; use csv for arg passing
atolopko-czi Jul 21, 2022
9d55a4b
annotate cleanup and minor subprocess fix
atolopko-czi Jul 22, 2022
8a91543
cleanup, revert unnecessary changes, fix build
atolopko-czi Jul 22, 2022
055d4fd
simplified cli annotate testing to a single "happy path" test
atolopko-czi Jul 22, 2022
7944081
TODOs, launch_and_open script
atolopko-czi Jul 25, 2022
0657b51
rename
atolopko-czi Jul 25, 2022
11acfa6
fmt, lint
atolopko-czi Jul 25, 2022
ab54218
install python via pyenv, for mlflow
atolopko-czi Jul 25, 2022
993c109
install python via pyenv, for mlflow
atolopko-czi Jul 25, 2022
a53b28d
install python via pyenv, for mlflow
atolopko-czi Jul 25, 2022
95e6234
Merge branch 'atolopko/2518-cell-type-prediction-cli-cmd' of github.c…
atolopko-czi Jul 25, 2022
47d0844
install virtualenv, for mlflow
atolopko-czi Jul 25, 2022
a6208a9
debug GHA unit test failures
atolopko-czi Jul 25, 2022
d4402b8
debug GHA unit test failures
atolopko-czi Jul 25, 2022
9874acd
skip failing tests for favicon
atolopko-czi Jul 25, 2022
c7d3829
reinstate gha jobs
atolopko-czi Jul 25, 2022
6a7bf42
longer timeout for annotation smoke tests
atolopko-czi Jul 26, 2022
37e5e40
fix embedding selection
atolopko-czi Jul 26, 2022
23db9a9
cleanup & tweak test code
atolopko-czi Jul 27, 2022
bac80b9
use conda for mlflow python env
atolopko-czi Jul 27, 2022
dc50cc1
add annotate cli tests
atolopko-czi Jul 27, 2022
c1d7ee7
add scanpy pkg for annotate subcmd
atolopko-czi Jul 28, 2022
a556b8a
test fix
atolopko-czi Jul 28, 2022
1f7469b
support conda env for mlflow, fix classifer opt default value
atolopko-czi Jul 28, 2022
60f86fe
cli fixes
atolopko-czi Jul 28, 2022
5103137
friendlier output on cli annotate model failure
atolopko-czi Jul 28, 2022
0d5b12f
Merge branch 'atolopko/2535-fix-embedding-selection' into atolopko/25…
atolopko-czi Jul 28, 2022
7370cbd
Merge branch 'main' into atolopko/2518-cell-type-prediction-cli-cmd
atolopko-czi Jul 28, 2022
d835ca9
fmt
atolopko-czi Jul 28, 2022
038c3f9
Merge branch 'main' of github.com:chanzuckerberg/cellxgene into atolo…
atolopko-czi Jul 28, 2022
0118cfa
Merge branch 'main' into atolopko/2518-cell-type-prediction-cli-cmd
atolopko-czi Jul 29, 2022
bcd42f3
lint
atolopko-czi Jul 29, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions .github/workflows/push_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,10 +46,12 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.7
uses: actions/setup-python@v4
- name: Set up Python 3.7 (pyenv) # pyenv needed for mlflow in cli annotate tests
uses: gabrielfalcao/pyenv-action@v9
with:
python-version: 3.7
default: 3.7
command: pip install -U pip # upgrade pip after installing python
- run: pip install virtualenv # virtualenv needed for mlflow in cli annotate tests
- name: Python cache
uses: actions/cache@v1
with:
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -54,3 +54,6 @@ client/.eslintcache

# E2E Testing
ignoreE2E*

# annotate subcmd
.models_cache
1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,6 @@ recursive-include server/common/web/static *

include server/requirements.txt
include server/requirements-prepare.txt
include server/requirements-annotate.txt
include server/converters/schema/hgnc_complete_set.txt.gz
include server/converters/schema/schema_definitions/*
16 changes: 16 additions & 0 deletions scripts/launch_and_open
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/usr/bin/expect -f

# Mac only! (depends upon `open` command)

set h5ad [lindex $argv 0]
puts "$h5ad"

spawn cellxgene launch $h5ad

set timeout 10
expect -indices -re "Please go to (http:\/\/localhost:\[0-9\]+)" {
set url $expect_out(1,string)
exec >@stdout 2>@stderr open $url
}

interact
Empty file added server/annotate/__init__.py
Empty file.
5 changes: 5 additions & 0 deletions server/annotate/annotation_types.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
from enum import Enum


class AnnotationType(Enum):
CELL_TYPE = "cell_type"
231 changes: 231 additions & 0 deletions server/cli/annotate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,231 @@
import functools
import json
import os.path
import shlex
import shutil
import subprocess
import sys
from subprocess import STDOUT, PIPE
from tempfile import NamedTemporaryFile

import click
import pandas as pd
from click import BadParameter

from server.annotate.annotation_types import AnnotationType
from server.common.utils.data_locator import DataLocator
from server.common.utils.utils import sort_options


def annotate_args(func):
@functools.wraps(func)
def wrapper(*args, **kwargs):
return func(*args, **kwargs)

return wrapper


@sort_options
@click.command(
short_help="Annotate H5AD file columns. Run `cellxgene annotation --help` for more information.",
options_metavar="<options>",
)
@click.option(
"-i",
"--input-h5ad-file",
required=True,
type=str,
help="The input H5AD file containing the missing annotations.",
)
@click.option(
"-m",
"--model-url",
required=True,
help="The URL of the model used to prediction annotated labels. May be a local filesystem directory "
"or S3 path (s3://)",
)
@click.option(
"-l",
"--counts-layer",
help="If specified, raw counts will be read from the AnnData layer of the specified name. If unspecified, "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say users have raw counts in raw.X and processed counts in X, and users wish to specify the processed counts. Is X available as an entry in adata.layers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adata.X will be used per this logic

"raw counts will be read from `X` matrix, unless 'raw.X' exists, in which case that will be used.",
)
@click.option(
"-g",
"--gene-column-name",
help="The name of the `var` column that contains gene identifiers. The values in this column will be used to match "
"genes between the query and reference datasets. If not specified, the gene identifiers are expected to exist "
"in `var.index`.",
)
# TODO: Useful if we want to support discoverability of models
# @click.option(
# "-r",
# "--model-repository",
# help="The base URL of the model repository. Maybe a local filesystem directory or S3 path (s3://)"
# )
# TODO: Useful if we want to support other, future annotation types, beyond "Cell Type". Currently hidden
@click.option(
"-a",
"--annotation-type",
type=click.Choice([t.value for t in AnnotationType]),
default=AnnotationType.CELL_TYPE.value,
show_default=True,
hidden=True, # Remove if we add support for more annotation types
help="The type of annotation to perform. This model to be used will be inferred from the annotation type.",
)
@click.option(
"-c",
"--annotation-prefix",
type=str,
default="cxg",
show_default=True,
help="An optional prefix used to form the names of: 1) new `obs` annotation columns that will store the predicted "
"annotation values and confidence scores, 2) `obsm` embeddings (reference and umap embedding), and "
"3) `uns` metadata for the prediction operation",
)
@click.option(
"-n",
"--run-name",
type=str,
help="An optional run name that will be used as a suffix to form the names of new `obs` annotation columns that "
"will store the predicted annotation values and confidence scores. This can be used to allow multiple "
"annotation predictions to be run on a single AnnData object.",
)
@click.option(
"-u",
"--update-h5ad-file",
is_flag=True,
help="Flag indicating whether to update the input h5ad file with annotation values. This option is mutually "
"exclusive with --output-h5ad-file.",
)
@click.option(
"-o",
"--output-h5ad-file",
help="The output H5AD file that will contain the generated annotation values. This option is mutually "
"exclusive with --update-h5ad-file.",
)
@click.option("--use-model-cache/--no-use-model-cache", default=True)
@click.option(
"--use-gpu/--no-use-gpu",
default=True,
help="Whether to use a GPU for annotation operations (highly recommended, if available).",
)
# TODO: This is a cell type model-specific arg, so not ideal to specify here as a hardcoded option
@click.option(
"--classifier",
default="default",
help="For cell type annotation, the classifier level to use. The classifier is model-dependent, so refer to "
"documentation for the specified model for valid values.",
)
# TODO: This is a cell type model-specific arg, so not ideal to specify here as a hardcoded option
@click.option(
"--organism",
type=click.Choice(["Homo sapiens", "Mus musculus"], case_sensitive=True),
default="Homo sapiens",
help="For cell type annotation, the organism of the dataset. Used to normalize gene names to HGLC conventions when "
"an annotation model has been trained using data from different organism.",
)
@click.option(
"--model-cache-dir",
default=".models_cache",
help="Local directory used to store model files that are retrieved from a remote location. Model files will "
"be read from this directory first, if they exist, to avoid repeating large downloads.",
)
@click.option(
"--mlflow-env-manager",
type=click.Choice(["virtualenv", "conda", "local"]),
default="virtualenv",
help="Annotation model prediction will be installed and executed in the specified type of environment. MacOS users "
"on Apple Silicon (arm64, M1, M2, etc.) are recommended to use 'conda' to avoid Python package installation "
"errors. If 'conda' is specified then cellxgene must also have been installed within a conda environment",
)
@click.help_option("--help", "-h", help="Show this message and exit.")
def annotate(**cli_args):
_validate_options(cli_args)

print(f"Reading query dataset {cli_args['input_h5ad_file']}...")

annotation_prefix = "_".join(
filter(None, [cli_args.get("annotation_prefix"), cli_args.get("annotation_type"), cli_args.get("run_name")])
)

output_h5ad_file = cli_args["input_h5ad_file"] if cli_args["update_h5ad_file"] else cli_args["output_h5ad_file"]

model_url = cli_args.get("model_url")
local_model_path = _retrieve_model(cli_args.get("model_cache_dir"), model_url, cli_args.get("use_model_cache"))

print(f"Annotating {cli_args.get('input_h5ad_file')} with {cli_args.get('annotation_type')}...")

if cli_args["annotation_type"] == AnnotationType.CELL_TYPE.value:
predict_args = dict(
query_dataset_h5ad_path=cli_args.get("input_h5ad_file"),
output_h5ad_path=output_h5ad_file,
annotation_prefix=annotation_prefix,
counts_layer=cli_args.get("counts_layer"),
gene_column_name=cli_args.get("gene_column_name"),
classifier=cli_args.get("classifier"),
organism=cli_args.get("organism"),
use_gpu=cli_args.get("use_gpu"),
)
# Drop args that have values of `None` as these will cause problems when passing into MLflow predict, since it
# ultimately gets converted into 1-row Pandas DataFrame (None is interpreted as a float type column!)
predict_args = dict([(k, v) for k, v in predict_args.items() if v is not None])

# Invoke prediction using MLflow cli, as a separate process.
# This fully prepares the Python environment that is needed for executing the model.
# The Python environment will be reused after it is setup once.
with NamedTemporaryFile(buffering=0) as predict_args_file:
# write the mlflow predict arguments to a csv file, which will be passed to mlflow cmd
pd.DataFrame([json.dumps(predict_args)]).to_csv(predict_args_file, index=None)
predict_args_file.seek(0)

# run mlflow prediction in subprocess
predict_cmd = (
f"mlflow models predict "
f"--env-manager {cli_args['mlflow_env_manager']} "
f"--model-uri {local_model_path} "
f"--content-type csv --input-path {predict_args_file.name}"
)
p = subprocess.Popen(
args=shlex.split(predict_cmd), stdin=predict_args_file, text=True, bufsize=0, stdout=PIPE, stderr=STDOUT
)

# display mlflow process output as it runs
for line in p.stdout:
print(line.rstrip())

p.wait()
if p.returncode == 0:
print(f"Wrote annotations to {cli_args.get('output_h5ad_file')}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the h5ad file actually getting written? Does mlflow models predict support h5ad files natively?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model itself is writing out the h5ad file, so the output is ignored. For those curious the h5ad writing is done here.

else:
print("Annotation failed!")
else:
raise BadParameter(f"unknown annotation type {cli_args['annotation_type']}")


def _retrieve_model(model_cache_dir, model_url, use_cache=True):
local_cache_model_path = os.path.join(model_cache_dir, os.path.splitext(os.path.basename(model_url))[0])
if not os.path.exists(local_cache_model_path) or not use_cache:
print(f"Retrieving model from {model_url}")
# download from remote source
with DataLocator(model_url).local_handle() as model_archive_local_path:
# unpack archive to local cache dir
shutil.unpack_archive(model_archive_local_path, local_cache_model_path)
else:
print(f"Using cached model at {local_cache_model_path}")

return local_cache_model_path


def _validate_options(cli_args):
# TODO(atolopko): Use cloup library for this logic
if cli_args["update_h5ad_file"] and cli_args["output_h5ad_file"]:
click.echo("--update_h5ad_file and --output_h5ad_file are mutually exclusive")
sys.exit(1)
if not (cli_args["update_h5ad_file"] or cli_args["output_h5ad_file"]):
click.echo("--update_h5ad_file or --output_h5ad_file must be specified")
sys.exit(1)


if __name__ == "__main__":
annotate()
2 changes: 2 additions & 0 deletions server/cli/cli.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import click

from .annotate import annotate
from .launch import launch
from .prepare import prepare
from .upgrade import log_upgrade_check
Expand Down Expand Up @@ -31,4 +32,5 @@ def cli(upgrade_check):


cli.add_command(launch)
cli.add_command(annotate)
cli.add_command(prepare)
2 changes: 2 additions & 0 deletions server/requirements-annotate.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
mlflow
scanpy
1 change: 1 addition & 0 deletions server/requirements-dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@ python-jose>=3.2.0
twine>=1.12.1
-r requirements.txt
-r requirements-prepare.txt
-r requirements-annotate.txt
2 changes: 1 addition & 1 deletion server/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ fsspec>=0.4.4,<0.8.0
gunicorn>=20.0.4
h5py>=3.0.0
numba>=0.51.2
numpy>=1.17.5
numpy>=1.17.5,<=1.22
packaging>=20.0
pandas>=1.0,!=1.1 # pandas 1.1 breaks tests, https://github.com/pandas-dev/pandas/issues/35446
PyYAML>=5.4 # CVE-2020-14343
Expand Down
5 changes: 4 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@
with open("server/requirements-prepare.txt") as fh:
requirements_prepare = fh.read().splitlines()

with open("server/requirements-annotate.txt") as fh:
requirements_annotate = fh.read().splitlines()

setup(
name="cellxgene",
version="1.0.1",
Expand Down Expand Up @@ -40,5 +43,5 @@
"Topic :: Scientific/Engineering :: Bio-Informatics",
],
entry_points={"console_scripts": ["cellxgene = server.cli.cli:cli"]},
extras_require=dict(prepare=requirements_prepare),
extras_require=dict(prepare=requirements_prepare, annotate=requirements_annotate),
)
20 changes: 20 additions & 0 deletions test/unit/cli/mlflow_model_fixture.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
import shutil
from tempfile import TemporaryDirectory, mkstemp

import mlflow


def write_model(model) -> str:
with TemporaryDirectory() as mlflow_model_dir:
mlflow.pyfunc.save_model(mlflow_model_dir, python_model=model)
return shutil.make_archive(mkstemp()[1], "zip", mlflow_model_dir)


class FakeModel(mlflow.pyfunc.PythonModel):
def __init__(self, input_to_output: dict = {}):
self.input_to_output = input_to_output

def predict(self, context, model_input) -> None:
# this stdout output is useful for validating the input in a test, noting that this model will be invoked in a
# subprocess, so stdout is one means of communicating information back to the test code
print(f"__MODEL_INPUT__={model_input.iloc[0][0]}")
Loading