Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
July 19, 2020 03:49 pm GMT

How to version large files with Git

It's not a secret, Git doesn't handle large files well:

Indeed. The git architecture simply sucks for big objects. It was discussed somewhat during the early stages, but a lot of it really is pretty fundamental. (Linus Torvalds)

In this short post I'd like to:

  • See what tools are available there to handle large files with Git
  • Try one of those - DVC

Have you ever committed a few 100 MBs file to then realize it's part of the repo now and it would take quite an effort to carve it out and fix the repo:

Large file in Git

Git clone takes hours, regular operations might take minutes instead of seconds - not the best idea indeed. And still, there are a lot of cases where we want to have a large file versioned in our repo - from game development to data science where we want to handle large datasets, videos, etc.

So, let's see what open-source and Git-compatible options do we have to deal with this:

  • Git-LFS - Github and Gitlab both support it and can store large files on their servers for you, with some limits

  • Git-annex - pretty powerful and sophisticated tool, but it makes it hard to learn and manage to my mind

  • DVC - Git for Data or Data Version Control - a tool made for ML and data projects, but on its fundamental level helps versioning large files

You can read (a somewhat outdated) overview of LFS and annex tools here, but this time I want to show you how the workflow looks like with DVC (yes! I'm one of the maintainers).

After DVC is installed all we need to do is to run dvc add and set a storage you'd like to use to store your large files.

Let's try it right here and there, first we need a dummy repo:

$ mkdir example$ cd example$ git init$ dvc init$ git commit -m "initialize"

Second, generate a large file:

$ head -c1000000 /dev/urandom > large-file # Windows: fsutil file large-file test.txt 1048576

The workflow is similar to Git, but instead of git add and git push we run dvc add and dvc push when we want to save a large file:

$ dvc add large-file

Now, let's save it somewhere (we use Google Drive here, but it can be AWS S3, Google Cloud, local directory, and many other storage options):

$ dvc remote add -d mystorage gdrive://root/Storage$ dvc push

You'd need to create the Storage directory in your Google Drive UI first and dvc push will ask you to give it access to your storage. It is absolutely safe! - credentials are saved on your local machine in the .dvc/tmp/gdrive-user-credentials.json, no access given outside.

Now, we can do git commit to save DVC files instead of a large file itself (you can run dvc status to see that large-file is not handled and visible by Git anymore):

$ git add .$ git statusOn branch masterChanges to be committed:  (use "git restore --staged <file>..." to unstage)    modified:   .dvc/config    new file:   .gitignore    new file:   large-file.dvc$ git commit -a -m "add large file"

That's it for today, next time we'll see how did it work, what does large-file.dvc mean, why does it create .gitignore and how can we get our file back!


Original Link: https://dev.to/shcheklein/how-to-version-large-files-with-git-2ij1

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To