Check out our new VS Code extension for experiment tracking and model development
If you want to follow along with this tutorial and try MLEM, you can use our example repo.
$ git clone https://github.com/iterative/example-mlem-get-started
$ cd example-mlem-get-started
$ git checkout 1-dvc-mlem-init
Next let's create an isolated virtual environment to cleanly install all the requirements (including MLEM) there:
$ python3 -m venv .venv
$ source .venv/bin/activate
$ pip install -r requirements.txt
Often it’s a bad idea to store binary files in Git, especially big ones. To solve this MLEM can utilize DVC capabilities to connect external cloud storage for model and dataset versioning.
We will reorganize our example repo to use DVC.
First, let’s initialize DVC and add a remote (we will use a local one for easier testing, but you can use whatever is available to you):
$ dvc init
$ dvc remote add myremote -d /tmp/dvcstore/
$ git add .dvc/config
Now, we also need to setup MLEM so it knows to use DVC.
$ mlem config set core.storage.type dvc
✅ Set `storage.type` to `dvc` in repo .
Also, let’s add .mlem
files to .dvcignore
so that metafiles are ignored by
DVC.
$ echo "/**/?*.mlem" > .dvcignore
$ git add .dvcignore
Next, let’s remove artifacts from Git and re-save them, so MLEM can use new storage for them. You don't need to change a single line of code
$ git rm -r --cached .mlem/
$ python train.py
Finally, let’s add and commit new metafiles to Git and artifacts to DVC, respectively:
$ dvc add .mlem/model/rf
$ git add .mlem
$ git commit -m "Switch to dvc storage"
...
$ dvc push -r myremote
$ git push
...
Now, you can load MLEM objects from your repo even though there are no actual binaries stored in Git. MLEM will know to use DVC to load them.
DVC pipelines are the useful DVC mechanism to build data pipelines, in which you can process your data and train your model. You may be already training your ML models in them and what to start using MLEM to save those models.
MLEM could be easily plug in into existing DVC pipelines. You'll need to mark
.mlem
files as cache: false
[outputs] of a pipelines stage. [outputs]:
https://dvc.org/doc/user-guide/project-structure/pipelines-files#output-subfields
Let's continue using the example from above. First, let's stop tracking the
artifact .mlem/model/rf
in DVC.
$ dvc remove .mlem/model/rf.dvc
Now let's create a simple pipeline to train your model:
# dvc.yaml
stages:
train:
cmd: python train.py
deps:
- train.py
outs:
- .mlem/model/rf
- .mlem/model/rf.mlem:
cache: false
The binary was already in, so there's no need to add it again. For the metafile,
we've added two rows and specify cache: false
to track it with DVC while
storing it in Git.
You can verify everything is working by running the pipeline:
$ dvc repro
Running stage 'train':
> python train.py
Use `dvc push` to send your updates to remote storage.
Now DVC will take care of storing binaries, so you'll need to commit model
metafile (.mlem/model/rf.mlem
) and dvc.lock
only. Learn more about
DVC and how it can be useful for training your ML models.