Announcement

**chocolate** · 08 December 2019, 10:09 AM

With GCC having such a long history, it's bound to have stressed every part of SVN. Some parts are not properly documented and even the SVN developers have admitted to undefined behavior. ESR is respecting GCC contributors past and present by honoring their hard work without leaving anything to fate. This is called proper software engineering. He has reached out to SVN developers to ask for clarification on various aspects.
What is going on in these forums is just the usual mockery by those whose disbelief and pessimism is only surpassed by their ignorance. I could understand if you had something to say about his Python code (haven't looked at the Go conversion yet) being somewhat sloppy, but no, it's always the same ad hominem attacks. Would be nice of you to at least thank him for his scrupulous investigations so far.

**microcode** · 08 December 2019, 10:30 AM

Originally posted by stevecrox View Post

Anyone else wonder why this has taken so long? Let's say the current tools aren't any good

SVN stores data in either Berkley Database format or via Flat files.

BerkleyDB has an SQL interface, so querying it for a sequential order isn't too hard and you be able to pull out each "commit" fairly quickly and in sequence. The BerkleyDB is a pain as reading up on the specs you'd need to replicate their hosting environment reasonably closely.

The FSFS format is a lot of small files. But those files have a commit record and a delta for each file. This is quicker and nicer because you can read the files line by line and the only Ram requirement would be streaming the binary for new objects/binary. The difficulty here is making sure you get the sequence correct, I'd be tempted to have a database with file location/time/etc.. so I could quickly move through the files. Otherwise you would potentially have to iterate over every file each time to find the next.

SVN is primarily trunk based development but does support branches. So you'd probably want to create a test setup to make sure you recognise when a branch exists.

Parsing multi TiB datasets like this is standard data engineering and you never throw a giant machine at it. Better to have a dozen small machines than one Epyc.

Git is a merkle DAG, you can't distribute or parallelize this work end-to-end. They might be able to if they split the task into preparing diffs/emails and actually committing.

Of course, they probably could accomplish this with a large mmapped file and some patience.

**dreamer_** · 08 December 2019, 11:29 AM

Anyone claiming "he should just do this" or "just do that" probably never had a "pleasure" of converting huge SVN repo to Git. git-svn is easy and fast and good... if all you have to move is clean, nice, trunk-development-only, standard layout SVN repo without crazy merge info commits, without binary files, without SVN variables, without importing history of ignore information, without user errors that were later "corrected" by direct database edits, without commits made to SVN tag paths, with frozen number of committers... perfectly SVN repo should be also frozen from receiving new commits during the migration. Also, your migration needs to be repeatable, because SVN output is sometimes non-deterministic and you want to catch the situation when your dump to Git is broken because some bit got incorrectly synchronized between two SVN slave repositories 10 years ago and a commit from that time is gibberish, breaking authorship information in the file... And hopefully, no one committed sha1 conflicting files, breaking only specific paths in the repo.

Having worked on such large migrations in the past... SVN is fucking terrible. Beyond the facade of few basic commands lies an ugly, broken-by-design system. git-svn is great, as it can be easily scripted to deal with many, many problems - but when codebase is huge (GCC most likely is, another example is Firefox), then some things are not easy to solve and require additional migrations steps, more tools (yay, BFG!), converters, etc.

**CommunityMember** · 08 December 2019, 11:40 AM

Originally posted by OneTimeShot View Post

Is SVN to GIT really that hard?

The issue is that ESR fell victim to one of the classic blunders! The most famous is never get involved in a land war in Asia, but only slightly less well known is the perfect is the enemy of the good enough.

ESR wanted the history to be absolutely identical, and due to the historical way GCC evolved (some ziggs, some zaggs, some dead ends) the history is extremely convoluted and does not map perfectly onto the git directed acyclic graph. Rather then what most projects do (damn the torpedo's, full speed ahead) and do a good enough conversion that lets developers move forward, and refer any of those who need something obscure from the past to the legacy SVN archives, ESR has continued to try to achieve perfection. And since it has never been on any critical development path, most people of the team don't have the time (or willingness) to get into a long discussion that will achieve no consensus about good enough being good enough.

**kreijack** · 08 December 2019, 11:42 AM

Originally posted by stevecrox View Post

Anyone else wonder why this has taken so long? Let's say the current tools aren't any good

[...]

Parsing multi TiB datasets like this is standard data engineering and you never throw a giant machine at it. Better to have a dozen small machines than one Epyc.

I suspect that the conversion is not simple as you describe.
From a technical point of view git tracks the content, instead svn is able to track the filesystem changes (i.e. it handle explicitly operation like rename and moving). So I suspect that this implies that is not a O(n) operation, but O(n^2) operation.
Moreover if it had been so simple, someone in the past two year would have done it :-)

**stormcrow** · 08 December 2019, 11:56 AM

Originally posted by kreijack View Post

I suspect that the conversion is not simple as you describe.
From a technical point of view git tracks the content, instead svn is able to track the filesystem changes (i.e. it handle explicitly operation like rename and moving). So I suspect that this implies that is not a O(n) operation, but O(n^2) operation.
Moreover if it had been so simple, someone in the past two year would have done it :-)

I suspect it's a combination of being a big task that not many people are interested in it as a technical exercise, people not wanting to deal with ESR chiming in every time someone brings it up with his NIH (not invented here) ranting, and probably quite a few people that either don't care about the VCS the project is using or just plain don't like git (which is also understandable).

And on top of that, it really is a big job and it will tie up a computer for however long it takes. Not everyone has multiple high(er) end systems they can dedicate to a single task, and/or they have a metered/data capped connection. But it seems like to me at least one person is fed up with it and putting their know-how and resources where their frustration is while ESR is scrambling to keep it under his belt and ego.

**reavertm** · 08 December 2019, 02:46 PM

Does anyone know whether GCC developers community actually cares what ESR thinks at this point?

**discordian** · 08 December 2019, 03:03 PM

Conversion is one thing, but I don't get why they did not convert to git and then do the cleanup there. Finding identical (sub)directories in git is rather easy with the hashes being used, and access is generally faster than with svn.
I wrote some scripts doing a similar thing, first just dump everything mostly 1:1 to git, then do some smarter conversions on top. Potentially redo conversions as problems arise.
You could even start working primary on git earlier (ie. on the few "live" branches), aslong as people are willing to rebase after some new cleanups.

**CommunityMember** · 08 December 2019, 03:31 PM

Originally posted by stormcrow View Post

and probably quite a few people that either don't care about the VCS the project is using or just plain don't like git (which is also understandable).

And the gcc developers who prefer git are probably using it, with the git-svn integration, to let the upstream project continue to look like svn while having the ability to use tools that are a bit more modern and familiar and integrated. I have gone the git-svn route for a few past projects until the rest of the community agreed to do a (good enough) conversion. That said, those projects were not the size, complexity, or broken history of gcc, so I think we can understand that large projects without corporate commitment, resources, and decision making (such as what happened with Microsoft and their conversion to git for the Windows team) are not going to happen quickly.

**stevecrox** · 08 December 2019, 03:56 PM

Originally posted by dreamer_ View Post

Anyone claiming "he should just do this" or "just do that" probably never had a "pleasure" of converting huge SVN repo to Git. git-svn is easy and fast and good... if all you have to move is clean, nice, trunk-development-only, standard layout SVN repo without crazy merge info commits, without binary files, without SVN variables, without importing history of ignore information, without user errors that were later "corrected" by direct database edits, without commits made to SVN tag paths, with frozen number of committers... perfectly SVN repo should be also frozen from receiving new commits during the migration. Also, your migration needs to be repeatable, because SVN output is sometimes non-deterministic and you want to catch the situation when your dump to Git is broken because some bit got incorrectly synchronized between two SVN slave repositories 10 years ago and a commit from that time is gibberish, breaking authorship information in the file... And hopefully, no one committed sha1 conflicting files, breaking only specific paths in the repo.

Having worked on such large migrations in the past... SVN is fucking terrible. Beyond the facade of few basic commands lies an ugly, broken-by-design system. git-svn is great, as it can be easily scripted to deal with many, many problems - but when codebase is huge (GCC most likely is, another example is Firefox), then some things are not easy to solve and require additional migrations steps, more tools (yay, BFG!), converters, etc.

As someone who converted multi-GB SVN and Perforce and other databases to Git (some ten+ years old with hundreds of developers on them. I'm happy to criticize, I think I did the SVN to Git transfers with Apache Ant/ant-contrib and left it to run over weekends.

Heck to be generous I put it as a data engineering problem and assumed a week or two to write at to pull that data out of BerkleyDB or FSFS.

This fundamentally isn't a hard problem 'big data' section of the industry regularly deals with larger pour quality data sets.

I'm not salty, it just reminds me of when a graduate/junior joins the team. You give them a project overview and assign them a non urgent but really useful task with detailed instructions and then leave them a day (to see how good they are asking for help). When you checkin they've either invented requirements, inverted some, attempted something crazy (write their own STL) or immediately hit a wall and not done anything. You've now got enough for a technical assessment and can see the big weaknesses to help them on.

It feels like gnu started that and forgot the check-in part and now it's been two years

Announcement

The GCC Git Conversion Heats Up With Hopes Of Converting Over The Holidays

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment