MLEM now offers deployment to Kubernetes and Sagemaker with a single command.
GTO helps you build an Artifact Registry out of your Git repository. It creates
annotated Git tags with a
special format, and manages an artifacts.yaml
file.
Storing large files in Git repos is not a good practice. Avoid committing your ML artifacts to Git. You can use DVC, Git LFS, or any other method to commit pointers to the data, models, etc. instead.
Using Git tags to register artifact versions and assign stages is handy, but the
Git tag itself doesn't contain a path to the artifact files, their type (model
or dataset
), or any other useful information about them. For simple projects
(e.g. a single artifact) we can assume the details whenever we consume the
artifacts (e.g. in CI/CD). But for more advanced cases, we
should codify them in the registry itself.
To keep this metadata, GTO uses a human-readable artifacts.yaml
file. The
gto describe
, gto annotate
, and gto remove
commands are used to display
and manage it's contents.
An example artifacts.yaml
can be found
in the example-gto
repo.
You may need to get a specific artifact version to a certain environment, most
likely the latest one or the one currently assigned to the stage. Use gto show
to find the Git reference (tag) you need (note that
CI platforms may expose it for you, e.g. the GITHUB_REF
env var in GitHub Actions):
GTO doesn't provide a way to deliver the artifacts, but you can use DVC or employ MLEM for that.
$ gto show [email protected] --ref
[email protected]
$ gto show churn#prod --ref # by assigned stage
[email protected]
You may need the artifact's file path. If
annotated, it can be discovered with
gto describe
:
$ gto describe [email protected] --path
models/churn.pkl
A popular deployment option is to use CI/CD (triggered when Git tags are pushed). For general details, check out something like GitHub Actions, GitLab CI/CD or Circle CI.
The other option is to configure webhooks that will send HTTP requests to your server upon pushing Git tags to the remote.
Finally, you can configure your server to query your Git provider via something like REST API to check if changes happened. As an example, check out Github REST API.
To act upon registrations and assignments (Git tags), you can create a simple CI
workflow. To see an example, check out
the workflow in example-gto
repo.
The workflow uses the GTO GH Action
that fetches all Git tags (to correctly interpret the Registry), finds out the
version
of the artifact that was registered, the stage
that was assigned,
and annotations details such as path
, type
, description
, etc, so you could
use them in the next steps of the CI.
If you would like to set up CI/CD, but don't want to use GTO GH Action, check
out gto show
, gto check-ref
and gto describe
commands.
To configure GTO, use file .gto
in the root of your repo:
# .gto config file
types: [model, dataset] # list of allowed Types
stages: [dev, stage, prod] # list of allowed Stages
When allowed Stages or Types are specified, GTO will check commands you run and
error out if you provided a value that doesn't exist in the config. Note, that
GTO applies the config from the workspace, so if want to apply the config from
main
branch, you need to check it out first with git checkout main
.
Alternatively, you can use environment variables (note the GTO_
prefix)
$ GTO_EMOJIS=false gto show
You can work with GTO without knowing these conventions, since
gto
commands take care of everything for you.
All events have the standard formats of Git tags:
{artifact_name}@{version_number}#{e}
for version registration.{artifact_name}@{version_number}!#{e}
for version deregistration.{artifact_name}#{stage}#{e}
for stage assignment.{artifact_name}#{stage}!#{e}
for stage unassignment.{artifact_name}@deprecated#{e}
for artifact deprecation.All of them share two parts:
{artifact_name}
prefix part.#{e}
counter at the end that can be omitted (in "simple" Git tag format).Generally, #{e}
counter is used, because Git doesn't allow to create two Git
tags with the same name. If you want to have two Git tags that assign dev
stage to model
artifact without the counter (model#dev
), that will require
deleting the old Git tag first. Consequently, that doesn't allow you to preserve
history of events that happened.
By default, #{e}
sometimes is omitted, sometimes not. We are setting defaults
to omit using #{e}
when it's rarely necessary, e.g. for version registrations
and artifact deprecations.