-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Reimplement model config storage backend, download and install modules #4252
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…InvokeAI into lstein/model-manager-refactor
Co-authored-by: Ryan Dick <[email protected]>
…InvokeAI into lstein/model-manager-refactor
| max_cache_size=config.ram, | ||
| max_vram_cache_size=config.vram, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two config values might be 'auto' and not a float
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like I never implemented any special handling for auto, nor am I sure what auto was supposed to do. I have reverted these config fields back to plain floats.
| class ModelInstallJob(DownloadJobBase): | ||
| """This is a version of DownloadJobBase that has an additional slot for the model key and probe info.""" | ||
|
|
||
| model_key: Optional[str] = Field( | ||
| description="After model installation, this field will hold its primary key", default=None | ||
| ) | ||
| probe_override: Optional[Dict[str, Any]] = Field( | ||
| description="Keys in this dict will override like-named attributes in the automatic probe info", | ||
| default=None, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are numerous references to attrs bytes and total of this class, but it does not appear to have them (neither does the parent class)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using classes like Java interfaces here. The actual job that is inserted into the queue is one of ModelInstallURLJob, ModelInstallRepoIDJob or ModelInstallPathJob. Each of these has a dual inheritance with ModelInstallJob and one of DownloadJobWithMetadata, DownloadJobRepoID and DownloadJobPath. The bytes and total_bytes fields come from the latter.
I use assertions and isinstance() to make sure that I'm operating on the type of job I think I should be, and mypy isn't complaining.
Here is part of the class hierarchy:
DownloadJobBase
defines the basic queueing stuff, like priority
DownloadJobRemoteSource(DownloadJobBase)
adds information returned by a remote source, including bytes and total_bytes
DownloadJobWithMetadata(DownloadJobRemoteSource)
adds a metadata field for remote sources that provide metadata about the downloaded object
ModelInstallJob(DownloadJobBase)
defines fields for model probing and the model key
ModelInstallURLJob(DownloadJobWithMetadata, ModelInstallJob)
combines the two classes together to get metadata, bytes and total bytes, and model probing and key fields
invokeai/app/cli/commands.py
Outdated
| try: | ||
| command_parser = subparsers.add_parser(cmd_name, help=command.__doc__) | ||
| except argparse.ArgumentError: | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we allow ArgumentErrors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was me attempting to fix the CLI, which died at some point. It's nonfunctional in main as well. I got halfway through and decided it wasn't worth it. I shouldn't have been doing this inside the models PR and will revert to the original broken state.
invokeai/app/api/routers/models.py
Outdated
| elif operation == JobControlOperation.CHANGE_PRIORITY and priority_delta is not None: | ||
| job_mgr.change_job_priority(job_id, priority_delta) | ||
|
|
||
| else: | ||
| raise ValueError(f"Unknown operation {operation.value}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If called with operation = JobControlOperation.CHANGE_PRIORITY and priority_delta = None, the returned error will be "Unknown operation...", which is incorrect. We should return an error that says something along the lines of: "priority_delta must be set for the CHANGE_PRIORITY operation."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
| # only tally tensor files | ||
| if not file.endswith((".ckpt", ".safetensors", ".bin", ".pt", ".pth")): | ||
| continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
Thinking aloud: In the future, we may want to use model hashes for the purpose of image provenance traceability. For example, to prove to an auditor that an image was generated with a particular set of models. The current hash implementation does not solve that use case, but that's ok. This hash implementation is designed to determine whether models are likely to be the same.
RyanJDick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got through another chunk today.
invokeai/app/api_app.py
Outdated
| if app_config.version: | ||
| print(f"InvokeAI version {__version__}") | ||
| return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that this is here, we can remove the duplicate logic under if __name__ == "__main__": in this file.
nit: Also, I think it makes sense to move this short-circuit to the top of this invoke_api() function - before the unnecessary import and function initialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
| # something went wrong, so don't leave dangling diffusers model in directory or it will cause a duplicate model error! | ||
| if new_diffusers_path: | ||
| rmtree(new_diffusers_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also clean up old_diffusers_path.
| # new values to write in | ||
| update = info.dict() | ||
| update.pop("config") | ||
| update["model_format"] = "diffusers" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to access this value via the ModelFormat Enum, rather than a hard-coded string.
| # If the internet is not available, then the repo_id tests are skipped, but the single | ||
| # URL tests are still run. | ||
|
|
||
| session = requests.Session() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we construct this single global mock session for all of the tests. I think it would be better to set up a separate session for each test. This way, when a test fails, it is very clear what setup code is actually relevant.
| with tempfile.TemporaryDirectory() as tmpdir: | ||
| # do we handle 404 and other HTTP errors? | ||
| job = queue.create_download_job(source="http://www.civitai.com/models/broken", destdir=tmpdir) | ||
| queue.join() | ||
| assert job.status == "error" | ||
| assert isinstance(job.error, HTTPError) | ||
| assert str(job.error) == "NOT FOUND" | ||
|
|
||
| # Do we handle missing content length field? | ||
| job = queue.create_download_job(source="http://www.civitai.com/models/missing", destdir=tmpdir) | ||
| queue.join() | ||
| assert job.status == "completed" | ||
| assert job.total_bytes == 0 | ||
| assert job.bytes > 0 | ||
| assert job.bytes == Path(tmpdir, "missing.txt").stat().st_size | ||
|
|
||
| # Don't let the URL specify a filename with slashes or double dots... (e.g. '../../etc/passwd') | ||
| job = queue.create_download_job(source="http://www.civitai.com/models/malicious", destdir=tmpdir) | ||
| queue.join() | ||
| assert job.status == "completed" | ||
| assert job.destination == Path(tmpdir, "malicious") | ||
| assert Path(tmpdir, "malicious").exists() | ||
|
|
||
| # Nor a destination that would exceed the maximum filename or path length | ||
| job = queue.create_download_job(source="http://www.civitai.com/models/long", destdir=tmpdir) | ||
| queue.join() | ||
| assert job.status == "completed" | ||
| assert job.destination == Path(tmpdir, "long") | ||
| assert Path(tmpdir, "long").exists() | ||
|
|
||
| # create a foreign job which will be invalid for the queue | ||
| bad_job = DownloadJobBase(id=999, source="mock", destination="mock") | ||
| try: | ||
| queue.start_job(bad_job) # this should fail | ||
| succeeded = True | ||
| except UnknownJobIDException: | ||
| succeeded = False | ||
| assert not succeeded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: These failure cases could all be separate tests to keep unit test surface area small and facilitate debugging failed tests.
| try: | ||
| self._lock.acquire() | ||
| assert isinstance(self._jobs[job.id], DownloadJobBase) | ||
| job.priority += delta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I might be confused around the intended use pattern.
Is the idea that we would
create_download_job(...), then possibly change its priority, thenstart_job(...)? If
this is the case, then I don't understand why we would ever want to do this - you could just wait to create
the job once you know the desired priority.
If the idea is that we would possibly change it's priority after
start_job(...), then that is not covered by the
current tests, and I don't think thePriorityQueuewill handle that.
I was under the mistaken impression that changing the sort order of an item that had already been enqueued in a priority queue would change the order in which it is dequeued. However, I just tested this and ended up with the items coming out in a shuffled order.
So I'm going to remove the priority-setting feature entirely. I thought it would be nice to let users move items up and down in the queue, but it isn't needed.
| queue.prune_jobs() | ||
| assert len(queue.list_jobs()) == 0 | ||
|
|
||
| def test_pause_cancel_repo_id(): # this one is tricky because of potential race conditions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the indentation here. This test will never run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
|
||
| import requests | ||
| from requests import HTTPError | ||
| from requests_testadapter import TestAdapter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love that we're bringing in an unmaintained dependency (hasn't been updated in 10 years, and the Github project has been archived). I think the same thing could be achieved relatively easily with unittest.mock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dunno. It was trivial to get requests_testadapter working, but with mock I'm going to have to implement stuff like HTTP header handling.
requests_testadapter is a trivial thing - just 86 non-blank lines. It's still working after 10 years and if it gets broken it will be super-easy to reimplement.
Co-authored-by: Ryan Dick <[email protected]>
Co-authored-by: Ryan Dick <[email protected]>
…InvokeAI into lstein/model-manager-refactor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've finished reviewing the ModelRecordServiceBase and the ModelLoadServiceBase sections of the documentation. Overall, I'm finding this documentation really helpful and so far I'm liking the service break-down.
I'll try to get to the rest of it soon.
| library. If the model is altered within InvokeAI (typically by | ||
| converting a checkpoint to a diffusers model) the key will remain the | ||
| same. The `hash` field holds the current hash of the model. It starts | ||
| out being the same as `key`, but may diverge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed in-person, but commenting for reference:
I think we should use a random unique ID as the key. I don't want anyone to be tempted to treat the key as a hash. I still think it could be worthwhile to store the original and current hash as metadata properties.
| All model metadata classes inherit from this pydantic class. it | ||
| provides the following fields: | ||
|
|
||
| | **Field Name** | **Type** | **Description** | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed in-person, for the initial implementation we will avoid storing any metadata that can later be retrieved from the original source.
Based on this, the new proposed set of metadata fields is:
- key
- name
- model_type
- model_format
- base_model
- path
- source
- original_hash (could be omitted in initial implementation)
- current_hash (could be omitted in initial implementation)
(Let me know if I got this wrong.)
| All model metadata classes inherit from this pydantic class. it | ||
| provides the following fields: | ||
|
|
||
| | **Field Name** | **Type** | **Description** | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the proposed metadata fields would apply to any new ML model that we add in the future, with the exception of 2: base_model and model_format. I think its worth thinking through how we plan to handle these long-term (even if we don't decide to change how they're handled in the short term):
base_model: BaseModelType
You described the issues with this field under "Limitations of the Data Model". Long term, I can imagine a wider range of model referencing requirements. E.g.:
- Independent models with no associated base model.
- Models that reference multiple types of other models. Such as IP-Adapters, which need to reference both a CLIP Vision model and a base SD model.
- Models that reference multiple models of the same type.
- Models that reference a specific model that they are compatible with (not just a model type). Again, IP-Adapter is an example.
model_format: BaseModelFormat
Currently, I find there to be a weird relationship between ModelType and ModelFormat. From one perspective, each ModelType should have its own set of ModelFormats. But, we sometimes rely on the fact that 'diffusers' and 'checkpoint' formats are common across multiple ModelTypes.
I've also noticed that our LoRA models don't fully leverage this field. I think this is because LoRAs have a more complicated format hierarchy involving both the file organization format and the state_dict naming format.
Generally, as we add more models I anticipate some confusion around when we should create a new ModelType and when we should create a new ModelFormat.
Proposal
This is one possible direction, but I'm definitely open to other ideas.
- Allow each
ModelTypeto define its own custom metadata. This allows each model to decide how to handle model-specific intricacies around model referencing, format handling, etc. - Store the model-specific metadata in an unstructured json blob in the DB.
- Querying for models based on info in the unstructured blob would get more complicated, but I think can be solved.
- I'd have to think through how this would work with model conversions some more...
Again, this would be a large effort. I just want to make sure we're thinking about it now. It would probably only make sense as a follow-up to all of the changes proposed in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are absolutely correct about the weird relationship of these fields. In addition, there is ModelVariantType which is used to distinguish between normal, inpaint and depth models. The challenge is to be fully generalizable without creating a cross-product explosion of bases, types, formats and variants, and requires a fundamental rethinking of how we handle models.
Here are some of the metadata dimensions we need to capture:
- What is the model's role in the image generation process? e.g. "latents denoising", "image classification"
- What compatibility group(s) does the model interoperate with? e.g. "Any", "SD-1"
- How do I load this model? e.g. location of model weights, config and other components
- How do I run this model? e.g. how to patch a LoRA's statedict into the encoder and denoiser,
I'd like to leave the current system be at the current time and return to this question when we start discussing a normalized models database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, leave as-is for now. Just wanted to raise so that it's in the back of our minds as we make design decision.
| or from elsewhere in the code by accessing | ||
| `ApiDependencies.invoker.services.model_record_store`. | ||
|
|
||
| ### Creating a `ModelRecordService` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its great that we are abstracting the underlying storage. I also think that the proposed class hierarchy and initialization logic could be simplified significantly via composition + dependency injection rather than inheritance. (I think it would also make it easier to write unit tests.)
I'm happy to write some pseudo code to explain what I mean if that would be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not seen any examples of how to do dependency injection in Python. Some pseudo code would be appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# ---- base_model_config_store.py ----
class BaseModelConfigStore(ABC):
@abstractmethod
def add_model(self, key, config):
pass
@abstractmethod
def del_model(self, key):
pass
@abstractmethod
def update_model(self, key, config):
pass
@abstractmethod
def get_model(self, key):
pass
# ...
# ---- model_config_store_sql.py ----
class ModelConfigStoreSQL(BaseModelConfigStore):
def __init__(self, db_conn):
self._db_conn = db_conn
def add_model(self, key, config):
# TODO: implement
...
# ...
# ---- model_config_store_yaml.py ----
class ModelConfigStoreYAML(BaseModelConfigStore):
def __init__(self, file_path):
self._file_path = file_path
def add_model(self, key, config):
# TODO: implement
...
# ...
# ---- model_record_service.py ----
class ModelRecordService:
def __init__(self, model_config_store):
self._model_config_store = model_config_store
def add_model(self, key, config):
result = self._model_config_store.add_model()
# Shared logic such as logging would happen at this layer. E.g.:
self._logger.debug(f"Added model ...")
def rename_model(self, key, new_name):
return self._model_config_store.update_model(key, {"name": new_name})
# ---- model_config_store.py ----
def init_model_config_store(db_conn, yaml_file_path) -> ModelRecordService:
# This is where the dependency injection comes in: we inject the ModelConfigStore* dependency.
if db_conn is not None:
return ModelRecordService(model_config_store=ModelConfigStoreSQL(db_conn))
elif yaml_file_path is not None:
return ModelRecordService(model_config_store=ModelConfigStoreYAML(yaml_file_path))
else:
# raise ...
# The improvements are minor, but significant:
# - Simplified from having 2 class hierarchies that have to handle both SQL and YAML to having 1 class hierarchy that worries about this.
# - Better separation of concerns. The ModelConfigStoreSQL is not responsible for knowing how to connect to the DB. The ModelRecordService does not need to know anything about the underlying storage.
# When writing tests for this:
# - We can run the same test suite against both the ModelConfigStoreSQL and ModelConfigStoreYAML.
# - We can test the ModelRecordService by injecting a mock model_config_store (though it's a very thin layer, so not a lot to test).| configuration to choose the implementation of | ||
| `ModelRecordServiceBase`. | ||
|
|
||
| ### get_model(key, [submodel_type], [context]) -> ModelInfo: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit of a nit-pick, but I'd propose changing this API to:
def get_model(self, key, context=None, **kwargs) -> ModelInfo:submodel_type is only relevant for a small subset of models, so it seems wrong to include it in the service-level API. It can be passed forwarded as a kwarg, and then model-specific get_model(...) implementations can expect it as a required argument.
| The returned `ModelInfo` object shares some fields in common with | ||
| `ModelConfigBase`, but is otherwise a completely different beast: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason for returning these metadata fields other than simply: "they are readily available so might as well return them"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the subset of metadata fields needed to load and run the model.
I'd rather return a ModelInfo object that contains a field for the model's ModelConfigBase rather than duplicating those fields, but the change would involve multiple detailed changes in Stalker's working model loading code.
| | `base_model` | BaseModelType | Base model for this model | | ||
| | `type` | ModelType or SubModelType | Either the model type (non-main) or the submodel type (main models)| | ||
| | `location` | Path or str | Location of the model on the filesystem | | ||
| | `precision` | torch.dtype | The torch.precision to use for inference | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the precision intended to be used for at this point? We have access to the model itself via context to check the current dtype, and that seems like a more accurate source of information for the use cases I can think of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be removed.
| The Model Manager is responsible for organizing the various machine | ||
| learning models used by InvokeAI. It consists of a series of | ||
| interdependent services that together handle the full lifecycle of a | ||
| model. These are the: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far, I'm really liking the improved modularity of this design.
One thing I'm wondering is if we should more clearly define the model API that invocations are intended to interact with. Specifically, these services expose a bunch of useful endpoints, but it seems like only a small subset should probably be used by invocations.
Separately, do we have a definition somewhere of what a "Service" is intended to be? There is a lot more interdependency that I would have expected between these services. I think those interdependencies are necessary - I'm more so questioning if they should all be defined on equal footing as "Services".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The context object given to invocations is far too powerful. I'm hoping that 3.4 will include a restricted wrapper for this. Though this has been planned for some time, I've only just now made the issue for it: #4887
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of "Services" as modules that have some specific purpose, which may be called by any other "service" and whose API is represented by an ABC. I don't think that exactly addresses your question, except to say that "service" currently refers more to the system design than scope of functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points. From the point of view of the MM refactor, is this actionable?
What I'd like guidance on is how to split code among invokeai/backend/model_manager and invokeai/app/services. Right now there's a bit of mirroring going on in which code that goes into backend is anything that doesn't need to know anything about the invocation context and event bus, while code in services builds on top of the backend code using either composition or inheritance. Is this the best thing to do or should everything get stuffed into services?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the split that you described makes sense for now.
Things in backend will tend to be more "library"-like, and more easily covered by unit tests. The services are like a layer on top of that and the testing style would probably be different (though I haven't fully thought through this yet).
psychedelicious
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot to click submit on my comments - I've removed the ones that were the same as Ryan's comments but there may be some overlap that I missed.
| With the exception of the install service, each of these is a thin | ||
| shell around a corresponding implementation located in | ||
| `invokeai/backend/model_manager`. The main difference between the | ||
| modules found in app services and those in the backend folder is that | ||
| the former add support for event reporting and are more tied to the | ||
| needs of the InvokeAI API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some potential challenges around interdependent services, for example, access to the database outside of the app should be handled with great care
| needed. There is also a Union of all ModelConfig classes, called | ||
| `AnyModelConfig` that can be imported from the same file. | ||
|
|
||
| ### Limitations of the Data Model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is something we should iterate on. If this changes in the future it's a major, potentially user-impacting change (model records may need to be changed in a db migration, metadata affected possibly, etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. What are your suggestions for alternative ways to manage the hierarchy of configuration classes?
| The `ModelRecordService` provides the ability to retrieve model | ||
| configuration records from SQL or YAML databases, update them, and | ||
| write them back. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not very enthusiastic about allowing both YAML and SQLite here. That's a lot of extra surface area
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To expand on this: If we believe there's value in supporting multiple storage backends, we could definitely manage it. But...
- I think it would be crucial that we have good unit test coverage of the storage interface (i.e. all of the supported queries get run against both backends). If the behavior of the storage backends were ever allowed to diverge, that could become very painful.
- We must recognize that some queries will be easier to write in one backend than the other. This may start to feel like a limitation at some point.
| ``` | ||
|
|
||
| The `ModelRecordServiceBase.open()` method is specifically designed for use in the InvokeAI | ||
| web server and to maintain compatibility with earlier iterations of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to maintain compatibility? Aren't we totally replacing this code?
I appreciate wanting to maintain existing service APIs, but I don't think we are really beholden to anybody except ourselves at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you folks think that the open() API is itself ok, then I can just remove the remark about maintaining compatibility. All I meant by this is that if the user's invokeai.yaml config file specifies using configs/models.yaml, then the YAML storage implementation will be chosen so that the user has access to all their previous models. This handles legacy installations.
New installations will default to using the sqlite backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. I'm partial to a single backend implementation, where if we see a reference to models.yaml, we migrate the app to use SQLite instead of retaining both sets of logic. But there is another thread to discuss that.
| In case of an error, the Exception that caused the error will be | ||
| placed in the `error` field, and the job's status will be set to | ||
| `DownloadJobStatus.ERROR`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if the app is restarted with a download in progress? Does it clean up on next startup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code currently uses a tempfile.TemporaryDirectory to store downloads in progress. If the app is shutdown cleanly, then the directory and its contents will be deleted. If the app crashes or is killed in a way that doesn't let Python run its context exit routine, then the temporary directory will still be there on restart but ignored.
Here are a few alternatives of increasing complexity. Which do you prefer?
- On startup, the system scans for dangling temporary directories and deletes them and their contents.
- Instead of storing the temporary directory under the randomly-assigned name that
TemporaryDirectorygives it, the name is assigned deterministically using the download source URL and doesn't get deleted. This allows the user to restart the download and the download will resume where it left off using the partial files. (This is actually an easy code change since restart of partial downloads is already implemented) - Some state information about downloads in progress are written into the invokeai database, allowing all downloads to be resumed automatically at startup time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current behavior that you described is ok for now. If python is shut down in a way that prevents temp directory cleanup (should be rare), then the temp directory will be deleted by the OS on the next reboot. (That's how it works on linux at least - unconfirmed for Windows).
|
|
||
| ## Get on line: The Download Queue | ||
|
|
||
| InvokeAI can download arbitrary files using a multithreaded background |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does threading allow downloads to continue during generation, or do they need to wait until the process has free time? I'm not familiar with threads so you'll need to ELI5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Downloads shouldn't interfere with generation. Downloads spend most of their time in I/O wait state either fetching the next chunk of model over the SSL connection or writing the chunk out to disk. During this time the CPU is free to run other threads. In addition, operations in the GPU are independent of CPU activity.
I haven't actually been able to put this to the test, but I can say that I've interacted with the python shell while big downloads were going on in the same process, and there wasn't any loss of interactivity.
| The Model Manager is responsible for organizing the various machine | ||
| learning models used by InvokeAI. It consists of a series of | ||
| interdependent services that together handle the full lifecycle of a | ||
| model. These are the: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The context object given to invocations is far too powerful. I'm hoping that 3.4 will include a restricted wrapper for this. Though this has been planned for some time, I've only just now made the issue for it: #4887
| The Model Manager is responsible for organizing the various machine | ||
| learning models used by InvokeAI. It consists of a series of | ||
| interdependent services that together handle the full lifecycle of a | ||
| model. These are the: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of "Services" as modules that have some specific purpose, which may be called by any other "service" and whose API is represented by an ABC. I don't think that exactly addresses your question, except to say that "service" currently refers more to the system design than scope of functionality.
RyanJDick
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished reviewing the docs!
| | `job_started` | float | | Timestamp for when the job started running | | ||
| | `job_ended` | float | | Timestamp for when the job completed or errored out | | ||
| | `job_sequence` | int | | A counter that is incremented each time a model is dequeued | | ||
| | `preserve_partial_downloads`| bool | False | Resume partial downloads when relaunched. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it would be hard to achieve in a robust way. Maybe leave as a future feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
preserve_partial_downloads? It actually isn't correctly implemented right now anyway, because the partial downloads are in a temporary directory that gets cleaned up when the app goes down. See the earlier comment thread about cleaning up partial downloads.
It should either be removed, or a robust resume system be implemented. I think I'll remove it for now.
| colors as they are progressively specialized for particular download | ||
| task. | ||
|
|
||
| The basic job is the `DownloadJobBase`, a pydantic object with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an implementation detail, but I think we should consider splitting this into:
MultithreadedQueueJob: Fields related to job queue managementDownloadJobConfig: Fields needed to execute a download job
Names are open for refinement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. How's this?
class MultithreadedQueueJob():
priority
id
status
error
job_started
job_ended
job_sequence
job_config: DownloadJobConfig
class DownloadJobConfig():
source
destination
event_handlers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that looks good 👍
| After an error occurs, any partially downloaded files will be deleted | ||
| from disk, unless `preserve_partial_downloads` was set to True at job | ||
| creation time (or set to True any time before the error | ||
| occurred). Note that since most InvokeAI model install operations | ||
| involve downloading files to a temporary directory that has a limited | ||
| lifetime, this flag is not used by the model installer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most InvokeAI model install operations involve downloading files to a temporary directory
What are the cases when we don't use a temp directory?
I'd suggest that we should always use a temporary directory. It makes cleanup/recovery from failure much simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc is incorrect. All model install operations use a temporary directory. I'll fix the doc. See above comments about resume capability
| It is possible for a caller to pause download temporarily, in which | ||
| case the events may look something like this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the use case for pausing download jobs?
It does introduce some complexity, so I'm wondering if it's important in the first version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a major pain to implement too. Unless @psychedelicious thinks pausing a download is important, I can remove this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think pausing is important
| - enqueued | ||
| - running | ||
| - running | ||
| - paused | ||
| - running | ||
| - completed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@psychedelicious Have you thought about how you would handle this on the frontend? What happens if you miss a paused or completed event? Is it better to publish state at a fixed interval? Or have the frontend request state when needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a FastAPI route that lists all jobs and their status, so the frontend can request state when needed.
If desired, I can add a route that will retrieve the status of a job given its ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that seems fine - use events, but also poll the route as needed.
| on the InvokeAI event bus in order to monitor the progress of the | ||
| requested installs. | ||
|
|
||
| The full list of arguments to `model_install()` is as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mixing arguments for various install sources makes this API confusing. I think it would be better to have separate methods for each source. Of course, at some layer we need to select between the different source types, but I think that can be done at a layer with a shorter argument list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. How's this?
model_install_from_path() # Existing model on disk
model_install_from_url() # Remote file, such as at civitai
model_install_from_repo() # hugging face repository
Each method will accept some arguments that are shared (source, priority) and others that are specific (inplace, subfolder).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks much clearer to me 👍
| | `variant` | str | None | Desired variant, such as 'fp16' or 'onnx' (HuggingFace only) | | ||
| | `subfolder` | str | None | Repository subfolder (HuggingFace only) | | ||
| | `probe_override` | Dict[str, Any] | None | Override all or a portion of model's probed attributes | | ||
| | `metadata` | ModelSourceMetadata | None | Provide metadata that will be added to model's config | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we support modifying metadata after installation via the ModelRecordServiceBase, is it necessary to support modification at install-time?
Maybe this is the most natural time to modify it 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just intended as a convenience. I was thinking about a user who wants to add a set of tags to their models automatically, but since tags are going away....
@psychedelicious Any thoughts on this?
| When you create an install job with `model_install()`, events will be | ||
| passed to the list of `DownloadEventHandlers` provided at installer | ||
| initialization time. Event handlers can also be added to individual | ||
| model install jobs by calling their `add_handler()` method as | ||
| described earlier for the `DownloadQueueService`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will have to be careful about multi-threading in the installer. The download step should definitely be parallelized, but the other installation steps (moving the model into place, updating the model record store, etc.) should perhaps be done on a single thread (or at least with appropriate locking) to avoid conflicts / corrupt models.
This is an implementation detail that can be discussed later, but just popped into my head when I saw that the DownloadQueueService might make callbacks to the installer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. I think that updating the model config store is ok for now because it thread locks. There's no locking done when the model's being moved from the temporary directory to its final location, however. and I guess this could be a problem?
I'm wondering if the best way to implement this is to have the model installer class creates a queue for handling the completion routines? When a download is finished, the download thread will place a completion routine on the queue, and then a single worker thread would dequeue and execute it. This would force all completion routines to run in a single threaded manner.
Should all event callbacks run in a single dedicated thread, or should some run in the same thread as the download?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good questions ... I'm going to have to give this some more thought. I feel like there's a way to simplify the download queue while achieving all of the same goals, just need to find some time to dig a little deeper. Shouldn't hold up the first PR to switch to using model IDs at least.
| For example, to temporarily pause all pending installations, you can | ||
| do this: | ||
|
|
||
| ``` | ||
| installer.queue.pause_all_jobs() | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels weird: we create an install job with the installer, but then have to reach into its members to control that job later.
I think it would be better to make the queue a private internal implementation detail of the installer (i.e. the installer would be the only thing that every interacts directly with its queue). Can anyone think of a case where another service/object would have to interact directly with the download queue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make the queue a private attribute.
| directory to those in the `ModelConfigRecordService` database. New | ||
| models are registered and orphan models are unregistered. | ||
|
|
||
| #### hash=installer.hash(model_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a member of the installer for any reason? Or should it just be a standalone utility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a convenience. There's a standalone class in the backend called FastModelHash that does the actual work. I'll just remove this from the code and docs.
|
Closing this now. Superseded by #5694 |
requests_testadapter is an archived repository Carries on from invoke-ai#4252 (comment)
What type of PR is this? (check all applicable)
See #4071 and #4043
Description
NOTE: Do not merge yet! This contains API-breaking changes that will prevent the web app from working.
This part of an implementation of the model manager refactors proposed in #4071 and #4043 , which will together allow model configurations to be fetched and stored in a modular backend (file or DBMS). For the time being, I am placing the code in
invokeai/backend/model_management2to avoid breaking the current code.This PR implements file and Sqlite3 Db-based storage for model configuration stanzas. See the tests for the API.
NEW: Documentation can be found in the file docs/contributing/MODEL_MANAGER.md
Tasks to implement:
ModelConfigStorageABCModelConfigStorageYAMLModelConfigStorageSQLModelInstallModel downloading, type probing and installation will be in a separate PR.
Instantiating the model from the config is a third PR.
Installation
This PR implements a new format for
models.yamlthat is incompatible with the old one. Therefore a one-time conversion step is needed:python scripts/convert_models_config_to_3.2.py --root <root> --outfile <outfile>The
--rootargument points to the InvokeAI root containing themodels.yamlyou wish to read from. Point at the root directory and not at themodels.yamlfile.--outfileshould be the path to a new or existing.yamlor.dbfile. If a yaml file is specified, then the YAML storage backend is used. If a .db file is specified, then a SQLite3 backend will be used. If no root is specified, the default will be chosen in the usual way. If no--outfileis specified, then the filemodels-3.2.yamlin the current directory will be created by default.~/invokeai/configs/models-3.2.yamlinvokeai.yamlto point to the new models config file. Theconf_pathoption points to it:databases/invokeai.db, but it will be harder to back out the changes.After this, you can launch
invokeai-weband use the Swagger API atlocalhost:9090/docsto list, search, install, delete and update models. The frontend is not currently working due to metadata changes.Testing
Example Usage - Installing models, high-level interface
This example uses
wait_for_installs()to block until all the download jobs complete. When used from the app, the downloads will generate a series ofmodel_eventevents on the event bus, and provide ajobobject that includes metainformation about the download job including the number of bytes downloaded and the amount still remaining.One can also install an event handler to be called whenver the status of a job changes. You do this by passing a customized
DownloadQueueobject to the installer:The queue can be manipulated using
DownloadQueuemethods. The installer'sinstall()method returns aModelInstallJobobject (a subclass ofDownloadJobBase), which can be passed to several queue manipulation methods:Example Usage - Config and Storage
Example Usage - Synchronous registration of local models
Example Usage - Downloading some models to a directory
There is a similar interface in
invokeai.app.services.download_manager.py, which additionally sendsdownload_job_eventevents to the system event bus. This version of the download queue gets added to theInvocationServicesAPI dependencies in a slot nameddownload_manager.Example Metadata Storage
When a model is downloaded and installed from HuggingFace or Civitai, its tags, author, license terms and other metadata are captured and stored in the configuration backend. Here is an example of what that looks like for a Civitai and a HuggingFace model:
Design Decisions
Here are design choices I made that I'm not 100% certain about.
New base config fields
I thought this would be a good time to add more metadata about models. So the base config looks like this:
During model installation, the installer will hash the model's contents (using something similar to PR #4180) and use it as its unique
idfield when the config is written to database or disk. Theget_model()method uses this ID as its key to return the corresponding model config. The combination of name, type and base are no longer required to be unique.All model config definitions are now in one file
invokeai/backend/model_management2/model_config.pyThe current model manager defines a series of model-specific modules, e.g.
lora.py, which know how to parse their configuration options, load their model type from disk, convert the model from checkpoint to diffusers formats, etc. This kept all model-specific logic together, but made the code somewhat dense. In the current design, all the configurations are kept in a single place and there is a factory method,ModelConfigFactory.make_config(), that turns a raw dict of model configurations into the appropriate class based on field discriminators.Storing a JSON blob in the sqlite3 database
In the sqlite3 backing store, I broke out separate columns for the model
id,name,base_model,model_typeandtagsfields for use in search and retrieval. All the other fields are in a serialized JSON string. For simplicity in serialization/deserialization, I do not remove these fields before serialization. As a result, the same data is in both the broken-out fields and the JSON blob. However, I think I do handle updates correctly and so the only issue is redundant storage.Search tags are normalized and occupy their own tables. There are various triggers and integrity checks to keep them in sync with updates to the model config table.
I did not break the
pathout as a separate field in the sql implementation. This is likely something that needs to be changed, because I can see use cases in which we need to know whether two models share the same path.Downloading repo_ids from HuggingFace
The code for this is a bit baroque because it initiates multiple sub-download jobs. It works like this:
model_index.json. If one is there, we download its contents in the current thread, parse it, and figure out what subdirectories need to be downloaded.pytorch_model.safetensorsis preferred overpytorch_model.binif both are present;pytorch_model.fp16.safetensorsis preferred overpytorch_model.safetensorsif thefp16variant is requested.related issues