Often it’s a bad idea to store binary files in Git, especially big ones. To solve this MLEM can utilize DVC capabilities to connect external cloud storage for model and dataset versioning.
We will reorganize our example repo to use DVC.
First, let’s initialize DVC and add a remote (we will use a local one for easier testing, but you can use whatever is available to you):
$ dvc init $ dvc remote add myremote -d /tmp/dvcstore/ $ git add .dvc/config
Now, we also need to setup MLEM so it knows to use DVC.
$ mlem config set core.storage.type dvc ✅ Set `storage.type` to `dvc` in repo .
Also, let’s add
.mlem files to
.dvcignore so that metafiles are ignored by
$ echo "/**/?*.mlem" > .dvcignore $ git add .dvcignore
Finally, we need to stop Git from keeping already indexed binaries.
$ git rm -r --cached .mlem
Next, let’s remove artifacts from Git and re-save them, so MLEM can use new storage for them. You don't need to change a single line of code
$ git rm -r --cached .mlem/ $ python train.py
Finally, let’s add and commit new metafiles to Git and artifacts to DVC, respectively:
$ dvc add .mlem/model/rf $ git add .mlem $ git commit -m "Switch to dvc storage" ... $ dvc push -r myremote $ git push ...
Now, you can load MLEM objects from your repo even though there are no actual binaries stored in Git. MLEM will know to use DVC to load them.
DVC pipelines are the useful DVC mechanism to build data pipelines, in which you can process your data and train your model. You may be already training your ML models in them and what to start using MLEM to save those models.
MLEM could be easily plug in into existing DVC pipelines. You'll need to mark
.mlem files as
of a pipelines stage.
Let's continue using the example from above. First, let's stop tracking the
.mlem/model/rf in DVC and stop ignoring MLEM files in
$ dvc remove .mlem/model/rf.dvc # we can delete the file since there are no other records # beside one we added above: $ git rm .dvcignore
Now let's create a simple pipeline to train your model:
# dvc.yaml stages: train: cmd: python train.py deps: - train.py outs: - .mlem/model/rf - .mlem/model/rf.mlem: cache: false
The binary was already in, so there's no need to add it again. For the metafile,
we've added two rows and specify
cache: false to track it with DVC while
storing it in Git.
You can verify everything is working by running the pipeline:
$ dvc repro Running stage 'train': > python train.py Use `dvc push` to send your updates to remote storage.
Now DVC will take care of storing binaries, so you'll need to commit model
dvc.lock only. Learn more about
DVC and how it can be useful for training your ML models.