Announcement

Collapse
No announcement yet.

GCC's Conversion To Git Is Still Turning Out To Be A Massive Headache

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BeardedGNUFreak
    replied
    When you absolutely, positively want to fuck your project's infrastructure...Python.

    What is amazing about the clusterfuck that is Python is years ago when the same type of dimwits who were using Perl the problems weren't even close to being this bad.

    One has to wonder if the idiotic 'hey, let's make whitespace syntax!' wasn't just mind boggling incompetence but some sort of masterful misdirection to distract from what a complete piece of garbage the actual language and ecosystem is.

    Leave a comment:


  • paulpach
    replied
    Originally posted by TylerY86 View Post
    Reposurgeon is a 14K line python file. Seriously difficult to navigate, no wonder gitlab can't run blame on it.
    Holy crap! you are right. It is massive and impossible to navigate.
    I am reading the code and there is no modularization whatsoever. There is a RepoSurgeon class with about 5000 LOC.
    I found 11 levels of nested ifs/for. WTF?
    There are methods with about 200 LOC.
    There is a lot of NIH code such as a Date class.

    In my professional opinion, you are right, this guy is not good at coding.

    Go is not going to miraculously fix this messy code. It needs some serious refactoring.

    Leave a comment:


  • fuzz
    replied
    Originally posted by TylerY86 View Post
    Reposurgeon is a 14K line python file.
    WTF????? Maybe he should have stuck with C... I have never heard of someone screwing up a python project so badly.

    Leave a comment:


  • TylerY86
    replied
    One thing I can not deny is that having a python app eat up 64GB of RAM means you're doing it wrong, he could throw entire SSDs worth of swap at it if he wanted but it seems like he's failing at some core memory management logic, and to say this sort of thing can't be parallelized is a joke... I could easily see this taking a week or with a bit of salt two to three, but more than that? This is not a geometric or exponential difficulty sort of problem...

    I think I want to take a look at this surgeon tool he's made...

    Edit;
    Looking at https://gitlab.com/esr/reposurgeon

    He's forcing pypy... oh boy, down the rabbit hole we go...

    First, pypy has some massive-slow-down issues around specific things... accumulation_tree, .split, .extend, .mean, datetime, and more...
    https://bitbucket.org/pypy/pypy/issu...us=open&q=slow

    I can't get a blame going in gitlab to figure out which lines are newest easily... so I'll check out the repo and do it locally.

    I see no attempt at multiprocessing or multithreading... repotool literally issues;
    cd %(project)s-git; time git -c pack.threads=1 repack -AdF --window=1250 --depth=250

    Reposurgeon is a 14K line python file. Seriously difficult to navigate, no wonder gitlab can't run blame on it.

    I also see no significant care for memory usage being taken, in fact I see the opposite...

    main on line 13979 starts as such;
    Code:
    def main():
        # Increase max stack size from 8MB to 512MB
        # Needed to handle really large repositories.
        try:
            sys.setrecursionlimit(10**6)
            import resource
            resource.setrlimit(resource.RLIMIT_STACK, (2**29,-1))
        except ImportError:
            # Don't fail to start if 'resource' isn't available
            pass
        except ValueError:
            # May not be allowed on some systems.  Whether or not we can do it
            # isn't interesting, it only matters whether the limit is actually
            # blown.
            pass
    Alright, just gonna come out and say it, I'm of the professional opinion at this time that this guy's a nutter.

    Needing a huge recursion limit and a giant stack means you're not managing memory correctly, and of course, you've got too much recursion, so you know... stop making so many recursive calls, change the recursive calls into loops.

    It'd be nice if this was split out into class files or such.

    If I was going to have to use this code as a basis for such a large task, I'd start by running it through Nuitka (https://gitlab.com/kayhayen/Nuitka) to convert it to C, maybe convert some to C++, fix some squiggly bits, part it out and organize it into a bunch of separate files, then go inline-marking, tail-recursion hunting and IPO/LTO on it. Might even try intel's auto-parallelizer.

    Not gonna waste more time unless someone wants me to. Going to get some coffee in me now.

    Edit; Came back the next day and cleaned it up a tiny bit.
    Last edited by TylerY86; 02 August 2018, 10:15 AM.

    Leave a comment:


  • paulpach
    replied
    Originally posted by Sniperfox47 View Post
    I have yet to see any kind of actual answer to this question but *why* can't this be done incrementally? A VCS seems to be an ideal environment for parallelization and incremental changes.

    Why can't they just clone the tree, convert what they have offline, then take the recent commits and convert those over afterwards?

    If it's a matter of matching emails and names to modernize them as they convert them for attributions, why isn't that cached to a database on filesystem as it works so that the existing data can be combined with the newest commits afterwards?

    This is not me telling anyone how to do their job, these are serious question. I don't proclaim to be an armchair expert or anything of the kind, just a lowly dev, but even to me this seems ludicrous. I get that Python is not the most efficient of languages but I can't comprehend how this problem can possibly be that intractable.
    I have done a few svn -> git migrations myself in my company.

    Usually it works like you say:
    1) dump the svn tree,
    2) convert it offline, it may take a few hours
    3) pull in the recent commits right before the switch.

    matching emails is a bit of a manual process, but it just done once before the conversion, I don't think this is much of a problem.

    The main issue is that subversion to git mapping is not clear cut in many cases. There are things that the tool will have to guess because svn does not have enough information

    Subversion does not really have branches. If you want to make a branch, you just copy your project directory into another directory inside the branches folder and work on the copy. It is just a convention: if it is copied inside the branches folder it is considered a branch. This is fine if everyone follows the convention. During the migration everything inside the branches folder can be considered a branch. If someone copies stuff around in unconventional ways, the migration tool will have to guess if it is a branch, a tag or just a copy.

    Subversion does not have tags, a tag is just another copy inside the tags folder. Again, it is just convention. It is perfectly valid to make a tag in subversion and then add a couple commits to the tag for fixing a couple things. This makes no sense at all in git, you cannot add commits to a tag, so either this tag will have to be mapped as a branch and the tag moved or the extra commits need to be ignored. The couple commits come _after_ the tag is created, so you can't even determine if a tag will be modified or not until you look through the entire history.

    Another issue is merging branches. Old subversion did not track at all merges. When you do a merge, subversion creates a patch containing all the changes in the other branch and apply it in the current branch. Old subversion did not keep track of what branches the merge came from, all you saw was that there was a change applied. Subversion merges (especially old subversion merges, and CVS migrated merges) cannot be properly mapped to git merges.


    Usually the projects I have worked in don't care that much. As long the HEAD is consistent, the history might be a little messed up. Bisection is particularly important for tools like gcc. It is very useful for finding regressions. Having a good commit history is important, but I think they will find reasonable compromises.

    I have no clue what kind of issues ESR is running into. However, I can assure you svn -> git migration is not always as simple as it sounds. I can imagine a 30 years old svn history migrated from CVS being particularly challenging
    Last edited by paulpach; 31 July 2018, 05:02 PM.

    Leave a comment:


  • Sniperfox47
    replied
    I have yet to see any kind of actual answer to this question but *why* can't this be done incrementally? A VCS seems to be an ideal environment for parallelization and incremental changes.

    Why can't they just clone the tree, convert what they have offline, then take the recent commits and convert those over afterwards?

    If it's a matter of matching emails and names to modernize them as they convert them for attributions, why isn't that cached to a database on filesystem as it works so that the existing data can be combined with the newest commits afterwards?

    This is not me telling anyone how to do their job, these are serious question. I don't proclaim to be an armchair expert or anything of the kind, just a lowly dev, but even to me this seems ludicrous. I get that Python is not the most efficient of languages but I can't comprehend how this problem can possibly be that intractable.

    Leave a comment:


  • jpg44
    replied
    Originally posted by torsionbar28 View Post
    I still don't understand why this is a performance problem. Is he trying to do some kind of live conversion without taking the repository offline? I.e. still accepting new commits while the conversion is in process? We can brute force RC5-64 searching 15,769,938,165,961,326,592 keys for a solution, but we don't have the computational power to migrate some code from one server to another?
    For RAM it comes down to how much state you have to have in memory at once. You can put state on disk, this slows things down however. If you are doing something where the program needs access to huge amounts of state at once, its not possible to just process the data in small chunks.

    A big repo can have lots and lots of branches and many are conflicting with mainline off of older versions than mainline, but which they may want to merge with mainline eventually, but not now. Since they are going to git, they want to be able to use git tools to do that, so they have to bring over a huge number of branches, one branch can be off another branch, off another branch, etc etc. So the need to move the history.

    Leave a comment:


  • jpg44
    replied
    Perhaps if they have a lot of outstanding branches which may be based on version of files older or newer than the mainline, and they may not want to fold these into the mainline, simply dumping the latest GCC tar.gz into git may not be the best option for them. You could start a new mainline branch in git and have people who have other branches in SVN have to migrate their branches over, this can involve moving over many, many branches anyway which are older or conflicting with mainline since branches are often made from branches which are made from branches, etc. They may branch from older branches than the current mainline. So, it is most ideal just to convert all of the history from svn to git.

    Leave a comment:


  • caligula
    replied
    Originally posted by perpetually high View Post

    This. I was really shocked to read the last thread on this and see all the armchair experts telling the gurus how it's done. Found that quite amusing.

    Having said that, I enjoyed reading the thread because as a non-guru (hehe), you learn a lot by people's suggestions. So I think tone matters, but still a good discussion.
    Well, is it the guru way to use a low performance, low concurrency (GIL), high memory footprint language for a job that requires over 64 or 128 gigabytes of RAM, while restricting processing to a single core? This was the first oddity in this story.

    The second was his miraculous super desktop computer. Keep in mind that many commenters here actually work with computational problems and deal with bigger computers than a single low end desktop. Even my personal computers at home have 32 & 64 gigabytes of RAM, 8 cores, and Geforce GTX 1080.

    Third, the algorithm design. People have also asked why he didn't do it incrementally. If you use the repositories in a normal way, all VCSs have a immutable history and mutable 'head' in the repository and support accessing the history in random access fashion. So it should be totally possible to provide concurrent read-only access to the history while providing exclusive RW access to the 'head' for the main repository process(es).

    I'd really prefer if people discussed on the level of technical details, not comparing the self-proclaimed titles and the height of the ivory towers they're standing on.

    Leave a comment:


  • dnebdal
    replied
    Originally posted by eva2000 View Post
    maybe reach out to folks at intel and cloudflare - they have smart folks that have a track history of improving the performance of open source tools with their forked versions
    I hear Microsoft are fairly good at managing huge repository migrations to git? I'm sure they'd love to help.
    (They might actually say yes and be useful if asked ... but I can't imagine that happening.)

    Leave a comment:

Working...
X