Announcement

**Terrablit** · 30 July 2018, 07:36 PM

Originally posted by perpetually high View Post

I understand but don't you think he's confided and spoken to domain experts on this before typing out each one of his emails? Come on, don't be naive. I'm sure he's not looking to be difficult here, he does these conversions often; this isn't new to him.

No, I don't think he's consulted anyone at all. It's not really his style in the first place. And I think he considers himself a domain expert. He's using a tool that he wrote (reposurgeon) because he thought the other tools weren't good enough. Whether or not that's true isn't the question. It provides some value. But he's the guy that wrote it so that it could fail on a repository even when on a machine with 64GB of ram. It's an easy mistake to make when you develop the tool based around small use cases. But it's also an easy mistake to recognize when it fails miserably on large SCM repositories. Which is why everyone's been rolling their eyes at him ignoring the elephant in the room.

Originally posted by perpetually high View Post

And if it's so easy, you do it.

Immediately converting 30 years of commit history in perfect form is not particularly valuable in my eyes from a contribution standpoint. It's not an interesting problem to me, and having a massive 30-year git repository could even obstruct some people (in broadband-poor areas) from contributing. It's ok if someone has to go look up some old code commits in SVN as long as the newish stuff is on Git. Which is part of why I'm disdainful of him overselling the problem. Nor is your argument that I should do this particularly valuable, as he also gets paid contracts to do this sort of conversion. Talking up the difficulty is advertising for him. Besides, he's got his patreon where he says he needs money to "concentrate on fixing the internet." He's collecting money to tilt at his windmills, so let him do it. But no one says we have to respect him for doing it wrong.

Originally posted by ESR's Patreon

Though I'm a techie, I'm in a situation similar to a fine artist because the market has not figured out how to value and reward the work I feel called to do. Unlike most artists, it wouldn't be difficult for me to get a well-paid job - but then I'd have to work on what an employer wants, rather than what the world actually needs.

Pledge to me so I can keep delivering what the world actually needs.

I'd rather work on widgets for the next five years than work on the same project as a person who regularly talks like this about his own contributions. Which is how ESR always talks.

Besides, if it's so hard, why don't you get some actual experience in the problem domain before talking about it with people who do? There's only three ways he could overload on a machine with 64GB of RAM when he's altering sequential commits on a timeline - loading everything at once, wasteful memory usage, and memory leaks. All of those have had solutions for over a decade. It's not our fault ESR's been playing Rip Van Winkle.

**eva2000** · 31 July 2018, 03:45 AM

maybe reach out to folks at intel and cloudflare - they have smart folks that have a track history of improving the performance of open source tools with their forked versions

**dnebdal** · 31 July 2018, 05:16 AM

Originally posted by eva2000 View Post

maybe reach out to folks at intel and cloudflare - they have smart folks that have a track history of improving the performance of open source tools with their forked versions

I hear Microsoft are fairly good at managing huge repository migrations to git? I'm sure they'd love to help.

(They might actually say yes and be useful if asked ... but I can't imagine that happening.)

**caligula** · 31 July 2018, 07:48 AM

Originally posted by perpetually high View Post

This. I was really shocked to read the last thread on this and see all the armchair experts telling the gurus how it's done. Found that quite amusing.

Having said that, I enjoyed reading the thread because as a non-guru (hehe), you learn a lot by people's suggestions. So I think tone matters, but still a good discussion.

Well, is it the guru way to use a low performance, low concurrency (GIL), high memory footprint language for a job that requires over 64 or 128 gigabytes of RAM, while restricting processing to a single core? This was the first oddity in this story.

The second was his miraculous super desktop computer. Keep in mind that many commenters here actually work with computational problems and deal with bigger computers than a single low end desktop. Even my personal computers at home have 32 & 64 gigabytes of RAM, 8 cores, and Geforce GTX 1080.

Third, the algorithm design. People have also asked why he didn't do it incrementally. If you use the repositories in a normal way, all VCSs have a immutable history and mutable 'head' in the repository and support accessing the history in random access fashion. So it should be totally possible to provide concurrent read-only access to the history while providing exclusive RW access to the 'head' for the main repository process(es).

I'd really prefer if people discussed on the level of technical details, not comparing the self-proclaimed titles and the height of the ivory towers they're standing on.

**jpg44** · 31 July 2018, 10:11 AM

Perhaps if they have a lot of outstanding branches which may be based on version of files older or newer than the mainline, and they may not want to fold these into the mainline, simply dumping the latest GCC tar.gz into git may not be the best option for them. You could start a new mainline branch in git and have people who have other branches in SVN have to migrate their branches over, this can involve moving over many, many branches anyway which are older or conflicting with mainline since branches are often made from branches which are made from branches, etc. They may branch from older branches than the current mainline. So, it is most ideal just to convert all of the history from svn to git.

**jpg44** · 31 July 2018, 10:44 AM

Originally posted by torsionbar28 View Post

I still don't understand why this is a performance problem. Is he trying to do some kind of live conversion without taking the repository offline? I.e. still accepting new commits while the conversion is in process? We can brute force RC5-64 searching 15,769,938,165,961,326,592 keys for a solution, but we don't have the computational power to migrate some code from one server to another?

For RAM it comes down to how much state you have to have in memory at once. You can put state on disk, this slows things down however. If you are doing something where the program needs access to huge amounts of state at once, its not possible to just process the data in small chunks.

A big repo can have lots and lots of branches and many are conflicting with mainline off of older versions than mainline, but which they may want to merge with mainline eventually, but not now. Since they are going to git, they want to be able to use git tools to do that, so they have to bring over a huge number of branches, one branch can be off another branch, off another branch, etc etc. So the need to move the history.

**Sniperfox47** · 31 July 2018, 11:19 AM

I have yet to see any kind of actual answer to this question but *why* can't this be done incrementally? A VCS seems to be an ideal environment for parallelization and incremental changes.

Why can't they just clone the tree, convert what they have offline, then take the recent commits and convert those over afterwards?

If it's a matter of matching emails and names to modernize them as they convert them for attributions, why isn't that cached to a database on filesystem as it works so that the existing data can be combined with the newest commits afterwards?

This is not me telling anyone how to do their job, these are serious question. I don't proclaim to be an armchair expert or anything of the kind, just a lowly dev, but even to me this seems ludicrous. I get that Python is not the most efficient of languages but I can't comprehend how this problem can possibly be that intractable.

**paulpach** · 31 July 2018, 04:44 PM

Originally posted by Sniperfox47 View Post

I have yet to see any kind of actual answer to this question but *why* can't this be done incrementally? A VCS seems to be an ideal environment for parallelization and incremental changes.

Why can't they just clone the tree, convert what they have offline, then take the recent commits and convert those over afterwards?

If it's a matter of matching emails and names to modernize them as they convert them for attributions, why isn't that cached to a database on filesystem as it works so that the existing data can be combined with the newest commits afterwards?

This is not me telling anyone how to do their job, these are serious question. I don't proclaim to be an armchair expert or anything of the kind, just a lowly dev, but even to me this seems ludicrous. I get that Python is not the most efficient of languages but I can't comprehend how this problem can possibly be that intractable.

I have done a few svn -> git migrations myself in my company.

Usually it works like you say:
1) dump the svn tree,
2) convert it offline, it may take a few hours
3) pull in the recent commits right before the switch.

matching emails is a bit of a manual process, but it just done once before the conversion, I don't think this is much of a problem.

The main issue is that subversion to git mapping is not clear cut in many cases. There are things that the tool will have to guess because svn does not have enough information

Subversion does not really have branches. If you want to make a branch, you just copy your project directory into another directory inside the branches folder and work on the copy. It is just a convention: if it is copied inside the branches folder it is considered a branch. This is fine if everyone follows the convention. During the migration everything inside the branches folder can be considered a branch. If someone copies stuff around in unconventional ways, the migration tool will have to guess if it is a branch, a tag or just a copy.

Subversion does not have tags, a tag is just another copy inside the tags folder. Again, it is just convention. It is perfectly valid to make a tag in subversion and then add a couple commits to the tag for fixing a couple things. This makes no sense at all in git, you cannot add commits to a tag, so either this tag will have to be mapped as a branch and the tag moved or the extra commits need to be ignored. The couple commits come _after_ the tag is created, so you can't even determine if a tag will be modified or not until you look through the entire history.

Another issue is merging branches. Old subversion did not track at all merges. When you do a merge, subversion creates a patch containing all the changes in the other branch and apply it in the current branch. Old subversion did not keep track of what branches the merge came from, all you saw was that there was a change applied. Subversion merges (especially old subversion merges, and CVS migrated merges) cannot be properly mapped to git merges.

Usually the projects I have worked in don't care that much. As long the HEAD is consistent, the history might be a little messed up. Bisection is particularly important for tools like gcc. It is very useful for finding regressions. Having a good commit history is important, but I think they will find reasonable compromises.

I have no clue what kind of issues ESR is running into. However, I can assure you svn -> git migration is not always as simple as it sounds. I can imagine a 30 years old svn history migrated from CVS being particularly challenging

**TylerY86** · 01 August 2018, 04:29 AM

One thing I can not deny is that having a python app eat up 64GB of RAM means you're doing it wrong, he could throw entire SSDs worth of swap at it if he wanted but it seems like he's failing at some core memory management logic, and to say this sort of thing can't be parallelized is a joke... I could easily see this taking a week or with a bit of salt two to three, but more than that? This is not a geometric or exponential difficulty sort of problem...

I think I want to take a look at this surgeon tool he's made...

Edit;
Looking at https://gitlab.com/esr/reposurgeon

He's forcing pypy... oh boy, down the rabbit hole we go...

First, pypy has some massive-slow-down issues around specific things... accumulation_tree, .split, .extend, .mean, datetime, and more...
https://bitbucket.org/pypy/pypy/issu...us=open&q=slow

I can't get a blame going in gitlab to figure out which lines are newest easily... so I'll check out the repo and do it locally.

I see no attempt at multiprocessing or multithreading... repotool literally issues;
cd %(project)s-git; time git -c pack.threads=1 repack -AdF --window=1250 --depth=250

Reposurgeon is a 14K line python file. Seriously difficult to navigate, no wonder gitlab can't run blame on it.

I also see no significant care for memory usage being taken, in fact I see the opposite...

main on line 13979 starts as such;

Code:

def main():
    # Increase max stack size from 8MB to 512MB
    # Needed to handle really large repositories.
    try:
        sys.setrecursionlimit(10**6)
        import resource
        resource.setrlimit(resource.RLIMIT_STACK, (2**29,-1))
    except ImportError:
        # Don't fail to start if 'resource' isn't available
        pass
    except ValueError:
        # May not be allowed on some systems.  Whether or not we can do it
        # isn't interesting, it only matters whether the limit is actually
        # blown.
        pass

Alright, just gonna come out and say it, I'm of the professional opinion at this time that this guy's a nutter.

Needing a huge recursion limit and a giant stack means you're not managing memory correctly, and of course, you've got too much recursion, so you know... stop making so many recursive calls, change the recursive calls into loops.

It'd be nice if this was split out into class files or such.

If I was going to have to use this code as a basis for such a large task, I'd start by running it through Nuitka (https://gitlab.com/kayhayen/Nuitka) to convert it to C, maybe convert some to C++, fix some squiggly bits, part it out and organize it into a bunch of separate files, then go inline-marking, tail-recursion hunting and IPO/LTO on it. Might even try intel's auto-parallelizer.

Not gonna waste more time unless someone wants me to. Going to get some coffee in me now.

Edit; Came back the next day and cleaned it up a tiny bit.

**fuzz** · 01 August 2018, 12:54 PM

Originally posted by TylerY86 View Post

Reposurgeon is a 14K line python file.

WTF????? Maybe he should have stuck with C... I have never heard of someone screwing up a python project so badly.

Announcement

GCC's Conversion To Git Is Still Turning Out To Be A Massive Headache

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment