Re: Git Rename Detection Bug
To
Jeremy Pridmore
Cc
git@vger.kernel.org
Paul Baumgartner
From
Elijah Newren
Date
2023-11-07 08:05:54 UTC
Hi,

On Mon, Nov 6, 2023 at 4:01 AM Jeremy Pridmore <jpridmore@rdt.co.uk> wrote:
>
> Thank you for filling out a Git bug report!
> Please answer the following questions to help us understand your issue.

I think it might be worthwhile to point out a few facts about rename
handling in Git, as background information that might clarify a few
things about how Git's mental model seems to differ from yours:

  * In git, renames are not tracked; they are detected (based on file
similarity of the commits being compared).
  * So, when you run "git mv A B", there is no rename recorded.  It's
basically the same as "git rm A", followed by creating B with the same
contents, followed by "git add B".
  * The detection happens whenever an operation (diff, log -p, merge,
status, etc.) needs or wants to know about renames.
  * In git, directory renames are detected _after_ normal renames, and
via amalgamation of the individual renames.
  * As a corollary of the last item, the only way individual renames
can be affected by directory renames, is if the individual rename on
one side of history was into a directory that the other side of
history renamed away; in such a case, we apply an _extra_ rename to
move it into the new directory.  But we don't "undo" individual
renames to make them fit the majority-determined directory rename or
anything like that.

(Also, if it matters, all of this is true of both `recursive` and
`ort` merge strategies, i.e. the old default merge backend and the new
one.)

> What did you do before the bug happened? (Steps to reproduce your issue)
> I have two GIT repositories (A and B). Both migrated from the same TFS server using git-tfs tool. I migrated code into A and made lots of changes, including moving 50,000+ files from folder "/Landscape" to "/Landscape/src".  B contains the same code but with various other changes made since my original migration from TFS to A.  All the files in B are still in the "/Landscape" folder.  I recently needed to merge my changes from A to B, so I added A as a remote to B and then performed a number of cherry-picks from A to B, but got stuck when trying to cherry-pick the commit containing the results of moving all files into "/Landscape/src".

In case anyone else wants to dig into this, note that this problem
setup precludes directory rename detection being involved.  Directory
rename detection has a rule where if the source directory wasn't
entirely removed on one side, then that directory was not renamed on
that side.  Seems obvious, but the upshot of that rule is that a
directory cannot be renamed into a subdirectory of itself, because by
virtue of being a subdirectory that means its parent directory still
exists.

So, this is a problem where only regular rename detection (i.e. rename
detection of individual files) is going to be at play.

> What did you expect to happen? (Expected behavior)
> I expected the git rename detection to match all files in A "/Landscape" to files in B "/Landscape/src".

Are all files under "/Landscape" from the merge base commit
individually more similar to the counterpart under "/Landscape/src"
than to files under any other directory?  If not, the expectation goes
against how rename detection has worked in git from the beginning.

> What happened instead? (Actual behavior)
> Although many files were matched successfully, git mismatched over two dozen similarly named files, e.g.
>
> Incorrect path match: Landscape/Services/uiServices/Complaints/Interfaces/IAccountsIntegration.vb -> Landscape/src/Complaints/Rdt.Complaints.UI/Interfaces/IAccountsIntegration.vb
> Incorrect path match: Landscape/Services/uiServices/Complaints/Interfaces/IDocumentIntegration.vb -> Landscape/src/Complaints/Rdt.Complaints.UI/Interfaces/IDocumentIntegration.vb
> Incorrect path match: Landscape/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 -> Landscape/src/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1
> Incorrect path match: Landscape/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1 -> Landscape/src/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1
> Incorrect name match: Landscape/Documentation/Rdt.Documentation.UI/Properties/licenses.licx -> Landscape/src/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1
> Incorrect path match: Landscape/Documentation/uiDocumentation/licenses.licx -> Landscape/src/Documentation/Rdt.Documentation.UI/Properties/licenses.licx
> Incorrect path match: Landscape/Import/uiImport/My Project/licenses.licx -> Landscape/src/Documentation/uiDocumentation/licenses.licx
> Incorrect path match: Landscape/Main/uiMain.Workflow/My Project/licenses.licx -> Landscape/src/Import/uiImport/My Project/licenses.licx
> Incorrect path match: Landscape/Main/uiMain/My Project/licenses.licx -> Landscape/src/Main/uiMain.Workflow/My Project/licenses.licx
> Incorrect path match: Landscape/LandscapeApiService.Setup/Setup/UIContent/RDT_Logo.ico -> Landscape/src/Main/uiMain.Workflow/Resources/RDT_Logo.ico
> Incorrect path match: Landscape/Policy/Rdt.Policy.UI.Templates/Properties/licenses.licx -> Landscape/src/Main/uiMain/My Project/licenses.licx
> Incorrect path match: Landscape/Main/uiMain.Workflow/Resources/RDT_Logo.ico -> Landscape/src/Main/uiMain/Resources/RDT_Logo.ico
> Incorrect path match: Landscape/Policy/Rdt.Policy.UI/Properties/licenses.licx -> Landscape/src/Policy/Rdt.Policy.UI.Templates/Properties/licenses.licx
> Incorrect path match: Landscape/Rates/uiRates/My Project/licenses.licx -> Landscape/src/Policy/Rdt.Policy.UI/Properties/licenses.licx
> Incorrect path match: Landscape/Rdt.Claim.UI/Properties/licenses.licx -> Landscape/src/Rates/uiRates/My Project/licenses.licx
> Incorrect path match: Landscape/Rdt.Landscape.UI.Templates.Workflow/Properties/licenses.licx -> Landscape/src/Rdt.Claim.UI/Properties/licenses.licx
> Incorrect path match: Landscape/Rdt.Landscape.UI.Templates/Properties/licenses.licx -> Landscape/src/Rdt.Landscape.UI.Templates.Workflow/Properties/licenses.licx
> Incorrect path match: Landscape/Rdt.Landscape.UI.Workflow/Properties/licenses.licx -> Landscape/src/Rdt.Landscape.UI.Templates/Properties/licenses.licx
> Incorrect path match: Landscape/Rdt.Landscape.UI/Properties/licenses.licx -> Landscape/src/Rdt.Landscape.UI.Workflow/Properties/licenses.licx
> Incorrect path match: Landscape/StandardLetters/uiStandardLetters/My Project/licenses.licx -> Landscape/src/Rdt.Landscape.UI/Properties/licenses.licx
> Incorrect path match: Landscape/Complaints/Rdt.Complaints.UI/Interfaces/IDocumentIntegration.vb -> Landscape/src/Services/uiServices/Complaints/Interfaces/IDocumentIntegration.vb
> Incorrect path match: Landscape/SystemEvents/uiSystemEvents/My Project/licenses.licx -> Landscape/src/StandardLetters/uiStandardLetters/My Project/licenses.licx
> Incorrect path match: Landscape/Services/busServices/RDT_Logo.ico -> Landscape/src/Startup/uiStartup.Workflow/Resources/RDT_Logo.ico
> Incorrect path match: Landscape/Startup/uiStartup.Workflow/Resources/RDT_Logo.ico -> Landscape/src/Startup/uiStartup/Resources/RDT_Logo.ico
> Incorrect path match: Landscape/Startup/uiStartup/Resources/RDT_Logo.ico -> Landscape/src/Startup/uiStartup32/RDT_Logo.ico
> Incorrect path match: Landscape/Startup/uiStartup/Resources/newrdlogogradiant48shad.ico -> Landscape/src/Startup/uiStartup32/newrdlogogradiant48shad.ico
> Incorrect path match: Landscape/Templates/uiTemplates.Workflow/My Project/licenses.licx -> Landscape/src/SystemEvents/uiSystemEvents/My Project/licenses.licx
> Incorrect path match: Landscape/Utils/Rdt.Utils.UI/Properties/licenses.licx -> Landscape/src/Templates/uiTemplates.Workflow/My Project/licenses.licx
> Incorrect path match: Landscape/Utils/uiUtils/My Project/licenses.licx -> Landscape/src/Utils/Rdt.Utils.UI/Properties/licenses.licx
> Incorrect name match: Landscape/WebServices/ServiceFabric/Policy/Rdt.Policy.Repository.Service.Fabric.Host/PackageRoot/Data/Swagger/Examples/POST_UKSTasks_Response.json -> Landscape/src/Utils/uiUtils/My Project/licenses.licx
>
>
> What's different between what you expected and what actually happened?
>
> As you can see, although the filenames (and content) are the same,

The content is the same as well?  So, these renames that you label as
incorrect are actually _exact_ renames -- and further, in most cases
they also have an identical basename for the file as well.

Exact renames are detected first, before any other method of rename
detection is employed, and other than giving a preference to files
with the same basename, if there are multiple choices it just picks
one (what I'd call at random, though technically based on what the
internal processing order happens to be; see the "Too many identical
alternatives? Pick one" code comment).

And this, too, is true of both the `recursve` and `ort` backends; no
change has been made to how exact renames are handled.

>  In some cases, it seems that the catalyst has been git thinking that a file from B has been deleted from A, when in fact it has not actually been deleted at all.
>
> For example, the file Landscape/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1 has not been deleted in A or B, therefore git should not have attempted to rename Landscape/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 to Landscape/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1, especially as it then attempts to rename Landscape/Deployment/PowershellScripts/pre-req/Workflow/pre-req.ps1 to Landscape/src/Deployment/PowershellScripts/pre-req/Rdt.BatchProcessingService.Setup/pre-req.ps1 and so on.

Renamed files, from Git's perspective, always involve files that have
been deleted.

> Git status contains, for example:
>         deleted by them: Landscape/Deployment/PowershellScripts/pre-req/Landscape.Net/pre-req.ps1

This means that it wasn't sufficiently similar to any of the new
files...or that _other_ deleted files were more similar to the new
files and thus that they were paired up instead of this file, leaving
this file to simply be marked as deleted.  (Or that other deleted
files were just as similar; tie-breakers are kinda random in such a
case.)

[...]

> Anything else you want to add:
> I can't help but think that this is related to changes made by Palantir:
> https://blog.palantir.com/optimizing-gits-merge-machinery-1-127ceb0ef2a1

Curious.  What makes you think it's related?

If there is some reason you think it's related, there's an easy way to
check -- just repeat the cherry-pick with the "-s recursive" flag to
use the old merge backend and compare the results.

I'll be somewhat surprised if it's related, though.

> I have tried to unstage these renames using "git restore --staged <file_name>" so I can then apply the correct "git mv" commands

Why?  Just modify all the files to have the correct end results and then commit.

>, but bizzarely, this then results in "git status" reporting a different, smaller set of mismatched names:

As mentioned earlier, git does _not_ record renames.  So, running the
correct "git mv" command doesn't really mean much.  If you use
completely "incorrect" git-mv commands, but then manually tweak files
until they have the correct results, then what's recorded is exactly
the same as if you had used the "correct" git-mv commands.

Further, when you run "git status", it can't access any renames you
did because that information isn't recorded anywhere.  It instead
recomputes renames on the fly.  And it does so each and every time you
run "git status", even if you make no changes between two invocations.
In fact, from this you can probably also deduce that there are other
ways to affect what will be shown as renames, when you have multiple
files similar to any given source file.  In particular, you can cause
a different pairing modifying one of the similar files enough that it
becomes the most similar to the source file, or so that it becomes no
longer the most similar to the source file.  However, what "git
status" reports for renames is irrelevant, since that info won't be
recorded in the commit.  Renames are never recorded.  Anywhere.

In fact, you can even run "git status --no-renames" to just see the
old filenames that were removed and the new ones that were added
without having all the files be paired up as renames.



Hope that helps,
Elijah