I have started using git-lfs more. It's been really helpful for keeping a bunch of BLOBs along with related code. Compared to a url based approach, this provides a partial Content Addressable Memory1 which is nicer in the reproducibility sense.
Naturally I had to check out DVC after someone pointed me to this blog post. I haven't used DVC extensively (have tried the tutorial and gone through little bits of its internals) but I don't really see it as a major philosophical upgrade over anything else as claimed popularly. This, of course, is not a no-no for an otherwise great project.
There are two major directions of differences and similarities from currently used solutions:
- Pipeline and workflow specification.
- Support for large data science files across multiple hosts.
In an attempt to try and merge these two, DVC enforces a certain workflow for pipelines which makes me a little uncomfortable. If I want minimum invasion, I will still keep other tools like snakemake and bunch of not so reproducible scripts where I don't really want hash verification of files etc. Even though I can mix this wherever needed, I am not sure the experiences will be pleasant unless I do everything via DVC.
The second point actually makes it a better git-lfs than git-lfs so I might try using it more, starting from the large file side and then seeing if I need the pipeline piece of it.
Footnotes:
Well it is exactly that, if you make the storage invisible