Your Web News in One Place

Help Webnuz

Referal links:

Sign up for GreenGeeks web hosting
May 23, 2022 11:46 am GMT

Squash commits considered harmful

A recurring conversation in developer circles is if you should use git --squash when merging or do explicit merge commits. The short answer: you shouldn't.

People have strong opinions about this. The thing is that my opinion is the correct one. Squashing commits has no purpose other than losing information. It doesn't make for a cleaner history. At most it helps subpar git clients show a cleaner commit graph, and save a bit of space by not storing intermediate file states.

Let me show you why.

Git tracks contents, not diffs

In many ways you can just see git as a filesystem.
Linus (in 'Re: more git updates..' - MARC)

Git is in many ways a very dumb graph database. When you check in code, it actually stores the content of all the tracked files in your repository.

The content of each file is stored as a "blob" node in the database. The filenames are stored separately in a "tree" node: If you rename a file, no new content node will be created. Only a new tree node will be created.

Commits are store as "commit" nodes. A commit object points to a tree, and adds metadata: author, committer, message and parent commits. A merge commit has multiple parents.

Here is a visualization from Scott Chacon's Git Internals:

Image description

Looking at a real git repository

Enough theory, we have work to get done. Let's create a simple git repository:

> mkdir squash-merges-considered-harmful> cd squash-merges-considered-harmful > git init> echo hello > foo.txt> git add foo.txt> git commit -m "Initial commit"[main (root-commit) 02a154b] Initial commit 1 file changed, 1 insertion(+) create mode 100644 foo.txt> echo more >> foo.txt> git add foo.txt> git commit -m "Add more" [main 16660f8] Add more 1 file changed, 1 insertion(+)

We can now look at the contents of the objects we created:

# initial commit git cat-file -p 02a154btree f269b7cd59094d5365ef6b5618098cbcbeee0c43author Manuel Odendahl <[email protected]> 1653303427 -0400committer Manuel Odendahl <[email protected]> 1653303427 -0400Initial commit# initial tree git cat-file -p f269b7cd59094d5365ef6b5618098cbcbeee0c43100644 blob ce013625030ba8dba906f756967f9e9ca394464a    foo.txt# initial foo.txt git cat-file -p ce013625030ba8dba906f756967f9e9ca394464ahello# second commit git cat-file -p 16660f8tree 5a0c4a660a13c0ada7611651399abb362756f83eparent 02a154bc4f0fa9bca567676d45d136619c076a95author Manuel Odendahl <[email protected]> 1653303485 -0400committer Manuel Odendahl <[email protected]> 1653303485 -0400Add more# second tree git cat-file -p 5a0c4a660a13c0ada7611651399abb362756f83e100644 blob 2227cddb7f6318ea735a1c4adb52f5cd36c5783c    foo.txt git cat-file -p 2227cddb7f6318ea735a1c4adb52f5cd36c5783chellomore

Branches, tags (and branches, tags on remote repositories) are just pointers to commit nodes.

 cat .git/refs/heads/main         16660f8b1d1538ed1b55d8533b3ee7feb68e474c

But we still use diffs and merges

But Manuel, you ask, how does git diff and git merge and all that funky stuff work?

When you run git diff, git actually uses different diff algorithm to compare the state of two trees, every time.

When you do a rebase, git computes the diff for each commit of the branch before rebase, and then applies those diffs to the destination, thus "moving" the branch over to the destination, with fresh tree and commit nodes.

When you do a merge, git first searches for the common parent of both branches to be merged (this can be a bit more involved depending on your graph). It computes the diff of each branch to that original commit, and then merges both diffs in what is called a three-way merge.

The resulting commit has multiple parent fields. The parent fields don't really mean anything except for informational purposes, the tree the merge commit points to is what actually counts. Once a three-way merge has been computed and applied, git doesn't really care how the resulting tree was computed.

This is literally all there is to git, and the mental model that I use every day, even as I'm doing the most advanced git surgery.

What is a squash merge?

So what is a squash merge? A squash merge is the same as a normal merge, except that it doesn't record only parent commit. It basically slices off a whole part of the git graph, which will later be garbage collected if not referenced anymore. You're basically losing information for no reason.

Let's look at this in practice. Let's create a few commits on top of the ones we have, and then do both a squash merge and a non-squash merge, and look at the results.

> git checkout -B work-branchSwitched to a new branch 'work-branch' echo "Add more" >> foo.txt git add foo.txt && git commit -m "Add more"[main 4b84cfe] Add more 1 file changed, 1 insertion(+) echo "Add more" >> foo.txt                  git add foo.txt && git commit -m "And more"[main 1836f1c] And more 1 file changed, 1 insertion(+) git checkout -B no-squash-merge mainSwitched to a new branch 'no-squash-merge' git merge --no-squash --no-ff work-branchMerge made by the 'ort' strategy. foo.txt | 2 ++ 1 file changed, 2 insertions(+) git checkout -B squash-merge mainSwitched to a new branch 'squash-merge' git merge --squash --ff work-branchUpdating 16660f8..1836f1cFast-forwardSquash commit -- not updating HEAD foo.txt | 2 ++ 1 file changed, 2 insertions(+) git commit[squash-merge 150c57d] Squashed commit of the following: 1 file changed, 2 insertions(+) 

Let's look at the resulting graph and commits.

 git log --graph --pretty=oneline --abbrev-commit --all* 150c57d (HEAD -> squash-merge) Squashed commit of the following:| * 535b740 (no-squash-merge) Merge branch 'work-branch' into no-squash-merge|/| | * 1836f1c (work-branch) And more| * 4b84cfe Add more|/  * 16660f8 (main) Add more* 02a154b Initial commit git cat-file -p no-squash-mergetree 58c1fb22faa444b264e98a5ae4c4ddb07be09697parent 16660f8b1d1538ed1b55d8533b3ee7feb68e474cparent 1836f1c53221ae701a038bf5ae380770ea911665author Manuel Odendahl <[email protected]> 1653304391 -0400committer Manuel Odendahl <[email protected]> 1653304391 -0400Merge branch 'work-branch' into no-squash-merge* work-branch:  And more  Add moresquash-merges-considered-harmful on  squash-merge on   ttc (us-east-1)  git cat-file -p squash-merge   tree 58c1fb22faa444b264e98a5ae4c4ddb07be09697parent 16660f8b1d1538ed1b55d8533b3ee7feb68e474cauthor Manuel Odendahl <[email protected]> 1653304543 -0400committer Manuel Odendahl <[email protected]> 1653304543 -0400Squashed commit of the following:commit 1836f1c53221ae701a038bf5ae380770ea911665Author: Manuel Odendahl <[email protected]>Date:   Mon May 23 07:11:08 2022 -0400    And morecommit 4b84cfe11aa51da994448e602e1bc4cc6083d691Author: Manuel Odendahl <[email protected]>Date:   Mon May 23 07:11:03 2022 -0400    Add more* ```{% endraw %}You can see that save that both {% raw %}`squash-merge`{% endraw %} and {% raw %}`no-squash-merge`{% endraw %} point to the exact same tree. The only changed thing is the commit message, and the missing parent in the squash merge.To read more about the underpinnings of git, I can recommend just experimenting with the git command line, and the following resources:- [Building Git by James Coglan](https://shop.jcoglan.com/building-git/)- [Git Internals by Scott Chacon](https://github.com/pluralsight/git-internals-pdf)## But the history!But Manuel, you say, the history is so much cleaner! To which I counter that it is actually not. If you want to hide the link to the right parent of the non-squash merge (as it is called, the left parent being {% raw %}`main`{% endraw %} ), all you need to do is to hide it. If you use the command-line or a proper tool, use the option to only show first parents. If you only look at the first parent, and configure your git tool to fill in a full log history of the branch into the merge commit message (I personally use the github CLI {% raw %}`gh`{% endraw %} or some git-commit hooks to do it), the squash merge commit is identical to the non squash merge commit. A favorite {% raw %}`git log`{% endraw %} command of mine to quickly look at the history of the main branch, and create a changelog:{% raw %}```shell> git log --pretty=format:'# %ad %H %s' --date=short --first-parent --reverse# 2022-05-23 02a154bc4f0fa9bca567676d45d136619c076a95 Initial commit# 2022-05-23 16660f8b1d1538ed1b55d8533b3ee7feb68e474c Add more# 2022-05-23 535b740f42e331175f3766c1374116e329a78f7e Merge branch 'work-branch' into no-squash-merge

When using github and pull requests, this will show author, branch name (which would contain ticket name and short description in my case) and date on a single line. Here's a slightly more complex real world example (anonymized)

# 2021-12-15 123 Merge pull request #5937 from garbo/TK-234/feature-1# 2021-12-16 234 Merge pull request #5938 from bongo/TK-235/feature-2# 2021-12-16 456 Merge pull request #5939 from gingo/TK-236/feature-3

But why?

But Manuel, why keep all those commits lying around when we have all we need in the commit message?

One comes down to just preference. I like to see the actual log of what a person did on their branch. Did they do many small commits? On which days (this might make looking up documents or slack conversations related to the work easier)? Did they merge other branches into their work (useful when resolving merge conflicts and other boo boos)?

I have done a lot of git cleanup work, and while they are not supposed to exist, big merges with thousands of lines happen, and having a single monolithic commit that contains 80 different changes is a nightmare.

The other one actually makes the side history extremely useful. When hunting down for a bug, I often use git bisect. I first use git bisect --first-parent to jump from main commit to main commit. But once I found which pull request led to the bug, I bisect on the original branch. Instead of having to figure out which line in the pull-request merge might cause the bug, I have a much more granular path. Often, it surfaces a single line commit, and leads to a painless and immediate bugfix.

As you can drive your bisect with your unit tests, you often have no work to do at all, given sufficiently atomic and small commits on side branches. Losing that capability would seriously impact my sanity when I have to fix bugs.

Conclusion

And that is why squashing history is harmful. It's literally just deleting information from the git graph by losing a single parent entry into the merge commit.


Original Link: https://dev.to/wesen/squash-commits-considered-harmful-ob1

Share this article:    Share on Facebook
View Full Article

Dev To

An online community for sharing and discovering great ideas, having debates, and making friends

More About this Source Visit Dev To