ML Version Control Fundamentals

Overview

ML Version Control is the practice of managing changes to the three pillars of Machine Learning: Code, Data, and Models (Environment/Config). Unlike traditional software engineering where versioning code is sufficient, ML sets a stricter requirement for reproducibility: you must be able to restore the exact state of the data and the training environment to reproduce a specific model version.

Key Ideas / Intuition

The Trinity of Artifacts

To reproduce a bug or a model in ML, you need a snapshot of three things simultaneously. If any one of these changes, the output (Model) changes.

graph TD
    Data[Data] --> Training
    Code[Code / Algo] --> Training
    Config[Hyperparams/Env] --> Training
    Training --> Model[Model Artifact]

The “Time Machine” Problem

Imagine you trained a model 3 months ago that had 90% accuracy. Today’s model has 85%. You checkout the old code (git commit), run it, and get… 82%. Why?

  • The Data in the database changed.
  • The Dependencies (library versions) updated.
  • You didn’t version the Data snapshot that was used 3 months ago.

Data vs. Code (The Storage Problem)

  • Git is designed for text (line-by-line diffs). It chokes on large binary files (datasets, model weights).
  • Solution: Store Data in object storage (S3, GCS) and store a pointer (small text file with hash) in Git.

Mathematical Foundation

Conceptually, a trained model is a function of the training dataset , the algorithm code , and hyperparameters :

For strict reproducibility: Where denotes the version snapshot.

Practical Application

The Tool Stack

  • Git: Handles Code () and Config ().
  • DVC (Data Version Control): Handles Data () and Model Artifacts (). It acts as a layer on top of Git.
  • MLflow / Weights & Biases: Logs metrics and hyperparameters (The “Captain’s Log” of experiments).

Typical Workflow (DVC + Git)

  1. Change Data: You add new images to data/raw/.
  2. Version Data: Run dvc add data/raw.
    • This creates data/raw.dvc (the pointer).
    • It moves actual data to .dvc/cache (and later pushes to S3).
  3. Version Pointer: Run git add data/raw.dvc.
  4. Commit: git commit -m "Updated dataset with Q4 images".
    • Now, that Git commit is mathematically linked to that specific version of the dataset.

When to Use

  • ALWAYS for production ML pipelines.
  • ALWAYS when dataset size > 100MB (Git limit).
  • ALWAYS when collaboration requires sharing datasets.

Comparisons

FeatureGitGit LFS (Large File Storage)DVC
Primary TargetText / CodeLarge BinariesData & Models
Storage BackendInternal / Git HostGit Host ServerAgnostic (S3, GCS, Azure, SSH, Local)
DeduplicationLine-basedFile-basedFile-based (Content Addressable)
ML AwarenessNoneNoneHigh (Pipelines, Metrics)
couplingTightly coupledBound to Git RepoLoosely coupled metadata

NOTE

Why not just Git LFS? LFS stores files on the Git server (e.g., GitHub). Storage there is expensive, and you often want your training data in the same cloud bucket as your compute (e.g., S3 + EC2) for speed. DVC allows using your own bucket.

Resources


Back to: 03 - MLOps & Infrastructure Index