🚀 DataChain Open-Source Release. Star us on !
To use MLEM with Git and enable GitOps, we need to commit MLEM models to Git
repository. While committing .mlem
metafiles is easy, model binaries and
datasets are too heavy to store in Git. To fix that, we suggest using
DVC. DVC
stores objects in remote storages,
allowing us to commit just pointers to them.
This page explains how to use DVC with an existing MLEM project. We will reorganize our example repo to showcase that.
If you want to follow along with this tutorial, you can use our example repo.
$ git clone https://github.com/iterative/example-mlem-get-started
$ cd example-mlem-get-started
Next let's create a Python virtual environment to cleanly install all the
requirements with pip
(including DVC and MLEM).
$ python3 -m venv .venv
$ source .venv/bin/activate
$ pip install -r requirements.txt
First, let’s initialize DVC and add a DVC remote (we will use a local one for easier testing, but you can use whatever is available to you):
$ dvc init
$ dvc remote add myremote -d /tmp/dvcstore/
$ git add .dvc/config
Now, we also need to setup MLEM so it knows to use DVC.
$ mlem config set core.storage.type dvc
✅ Set `storage.type` to `dvc` in repo .
After the initial configuration is done, we need to decide how we're going to use MLEM with DVC:
Let’s add .mlem
files to .dvcignore
so that metafiles are ignored by DVC.
$ echo "/**/?*.mlem" > .dvcignore
$ git add .dvcignore
We may need to stop Git from keeping already indexed binaries. For our example repo, that would be:
$ git rm -r --cached models data
Now we need re-generate them:
$ python train.py
Finally, let’s add and commit new metafiles to Git and artifacts to DVC, respectively:
$ dvc add models/rf
$ git add models
$ git commit -m "Switch to dvc storage"
...
$ dvc push -r myremote
$ git push
...
Now, you can load MLEM objects from your repo even though there are no actual binaries stored in Git. MLEM will know to use DVC to load them.
DVC pipelines is a mechanism to build data pipelines, in which you can process your data and train your model. You may be already training your ML models in them and what to start using MLEM to save those models.
MLEM could be easily plug in into existing DVC pipelines. You'll need to mark
.mlem
files as cache: false
outputs
of a pipelines stage.
Let's create a simple pipeline to train your model:
# dvc.yaml
stages:
train:
cmd: python train.py
deps:
- train.py
outs:
- models/rf
- models/rf.mlem:
cache: false
We mark the metafile with cache: false
so DVC pipeline is aware of it, while
still committing it to Git.
You can verify everything is working by running the pipeline:
$ dvc repro
Running stage 'train':
> python train.py
Use `dvc push` to send your updates to remote storage.
Now DVC will take care of storing binaries, so you'll need to commit model
metafile (models/rf.mlem
) and dvc.lock
only.
Learn more about DVC and how it can be useful for training your ML models.
If you commit model metafiles to a private repo and use DVC to store binaries,
you'll need to authenticate both via SSH and via HTTPS. SSH authentication is
required for DVC, since DVC shallow clones the repo underneath via SSH. MLEM
instead uses fsspec
's
GitHubFileSystem
to access the repo, which uses HTTPS for authentication.
SSH authentication is usually achieved by running git push
against a SSH
remote, or can be done using
gh auth login
(if you use
Github).
HTTPS authentication is done by setting GITHUB_USERNAME
and GITHUB_TOKEN
environment variables. You need to generate a token
here or via command line gh auth token
.
It's important to first authenticate with SSH, and only then with HTTPS.
Otherwise, running gh auth login
will complain that GITHUB_USERNAME
and
GITHUB_TOKEN
were already set (it assumes there should be a single
authentication method in place, while we need both).