Announcement

Collapse
No announcement yet.

Tesseract 5.0 Released For This Leading Open-Source OCR Engine

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tesseract 5.0 Released For This Leading Open-Source OCR Engine

    Phoronix: Tesseract 5.0 Released For This Leading Open-Source OCR Engine

    The long-awaited Tesseract 5.0 is now available as a big update to this leading open-source, optical character recognition (OCR) engine that via neural networks offers great accuracy and supports more than 100 languages for turning images of text into actual text...

    Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite

  • #2
    but in 2018 they stopped contributing as much to the effort, which seems to be partly why Tesseract 5.0 took so long to materialize. Much of Tesseract's recent activity has been by Stefan Weil of the UB Mannheim.
    it's turning into another undermanned project.

    Comment


    • #3
      I've used older versions of Tesseract and while it wasn't perfect, it was about the only open source solution that could process the several thousand docs I needed OCR'd in an automated manner. Even most consumer-level commercial solutions wouldn't have gotten the job done because they aren't setup for batch processing.

      Now that they are using an LSTM, I'd be interested to see bfloat16 support that slots into AVX-512 or AMX instructions to really boost the performance too.

      Comment


      • #4
        Okay, it has become faster by using float instead of double. This to me sounds more error prone. Though I of course appreciate any performance gain, in this case I just care for the quality of the results of the recognition process.

        A lot of use cases aren't even real time critical. I archive a lot of paper, and I don't care how much processing time the recognition process takes at all. If performace would actually become a real pita, I would simply provide more computing power. This is in most cases MUCH cheaper than dealing with recognition errors in the long term.
        Last edited by Joe2021; 01 December 2021, 09:56 AM.

        Comment


        • #5
          Originally posted by Joe2021 View Post
          Okay, it has become faster by using float instead of double. This to me sounds more error prone. Though I of course appreciate any performance gain, in this case I just care for the quality of the results of the recognition process.
          LOL. You don't need 64-bit precision for OCR. Depending on how the engine works, you could even get away with less than fp32, which would be the idea behind BFloat16 or int8.

          Remember, your input data is almost certainly going to be just uint8. So, at least the initial stages of recognition don't necessarily lose anything by reducing precision.

          Originally posted by Joe2021 View Post
          I don't care how much processing time the recognition process takes at all.
          You're sharing code with people who do care about performance and/or energy-efficiency. And they're making tradeoffs between those objectives vs. accuracy.

          Originally posted by Joe2021 View Post
          This is in most cases MUCH cheaper than dealing with recognition errors in the long term.
          No doubt, but there's a point where even you wouldn't be willing to spend more time/money/energy on OCR to achieve the next small improvement in accuracy.

          Comment


          • #6
            Originally posted by Joe2021 View Post
            Okay, it has become faster by using float instead of double. This to me sounds more error prone. Though I of course appreciate any performance gain, in this case I just care for the quality of the results of the recognition process.

            A lot of use cases aren't even real time critical. I archive a lot of paper, and I don't care how much processing time the recognition process takes at all. If performace would actually become a real pita, I would simply provide more computing power. This is in most cases MUCH cheaper than dealing with recognition errors in the long term.
            The difference in accuracy is probably too small to measure even on a massive scale. The performance, on the other hand, is not only measurable but pretty significant. I don't know how I would get a reasonable estimate of who does care about resources, but I'm sure there are a lot of applications that do something immediately and interactively with the input. When I worked in document management (left that in early 2015), there were a lot of things that were not very time-sensitive like your experience, but there are many other uses of OCR such as live translation, sorting into immediate routing logic, etc. However, the "more resources are cheap" argument is probably not true even for some offline scanning because a lot of OCR applications wind up having massive throughput, so the impact is considerable, especially when you likely will not have a single error occur as a result of this change.

            If you have an activation function that is so close to being on the edge, that the precision causes it to go the wrong way, then the direction it goes is probably not terribly important. This is one neuron of one layer in one iteration in a process that initializes with random weights (in training) or that has the benefit of already precomputing a matrix based upon millions of calculations (in classification). The individual calculations don't matter much. Part of the reason a neural net is used in OCR is that you get a lot of bad data, and a net can often handle large amounts of bad data as it condenses around a good answer.

            If micro-precision of a single calculation in a neural net mattered much to the end state of a classifier, you would probably also not be able to use a net on highly variable data like you get in OCR.

            Comment


            • #7
              Originally posted by Palu Macil View Post
              The difference in accuracy is probably too small to measure even on a massive scale. The performance, on the other hand, is not only measurable but pretty significant.
              [...]
              If micro-precision of a single calculation in a neural net mattered much to the end state of a classifier, you would probably also not be able to use a net on highly variable data like you get in OCR.
              Fine, that is good to hear, and I do agree that everything is a tradeoff, and I was definately not talking about ALL usecases. I just have a perspective like "processing I do only once, but the result persists for the rest of my life". Hence I care more for the results than for the process burden. That this has limits, and is therefore a tradeoff, well, I agreed on that.



              Comment


              • #8
                Originally posted by Joe2021 View Post
                Fine, that is good to hear, and I do agree that everything is a tradeoff, and I was definately not talking about ALL usecases. I just have a perspective like "processing I do only once, but the result persists for the rest of my life". Hence I care more for the results than for the process burden.
                I hope you keep the raw scans, as well. If there's ever any question, then you can at least go back and use your eyeballs to OCR the input.

                Comment


                • #9
                  Forgive me, but logically one must prioritize the primary function of a tool, in this case the recognition of something. If that tool does not properly perform its primary function then it is said to be a defective tool. This makes it difficult to accept a certain " trade-off " on what makes it possible to achieve the primary goal, in order to value objectives that are certainly very important, but nevertheless secondary to the primary function of any tool. So, in my opinion, given the primary reason of this tool, precision above all and if possible speed. Speed is fun too, I must admit!

                  Comment


                  • #10
                    For OCR, I actually use Omnipage 10 inside of Windows 95 in Dosbox-x. I like how fuses hobbies like retrocomputing with media archiving and it can help me gain another hobby, reading novels. I like the idea of being into novels, but I have school trauma in being forced to read, so I figure a good way to read is to mix hobbies. I follow people that read novels to create audio book covers as a hobby.

                    What I like about Omnipage 10 is it can export to a Word For Windows 2 document and you can print to PDF in Windows 3.11, so I could have an eBook collection that can be displayed in a 30 year old OS. I'm starting with the Star Wars Legends continuity. I put a hair dryer to my 1976 Star Wars novelization and took out every leaflet and scanned every side of every one and I could also glue the pages back together.


                    As somebody that has dabbled in book scanning, I recommend having a Windows Machine or VM available, a modern enough version to install Fujitsu's Scan Snap Drivers, that's a really fast document scanner if you don't mind removing pages for a fast scan. The easiest (and most destructive) way is just to cut the book and scan the leaflets and throw it away or you can do what I do preserve the pages and eventually glue it back together.


                    I only recommend destructive and semi-destructive methods for cheap books you bought in acceptable condition because if you mess it up, you either destroy a cheap book and still have the scans or it goes from acceptable condition back to acceptable condition once you glue it back together. If you rebind a near-mint book, it would be downgraded to "acceptable".


                    I actually recommend Windows 10 at least in a VM with USB passthrough, just for the Fujitsu Scan Snap fast scanning and giving new versions of Omnipage a try and I have a strong feeling Office 2021 will be the last version of office perpetual and the support of Windows 10 is ending in a few years, so at that point, Windows 10 will be a retrocomputing OS. A rule about retrocomputing OSes is it shouldn't touch the internet unless you need it for updates and then unplug the internet connection after the final patch Tuesday update is installed. You could also use Word 2021 as a rosetta stone for backwards compatibility back to Word 97 and even then Word 97 is backwards compatible with Wordstar 4.0, a version of Wordstar than you can run on a 70's CP/M machine. Also Word 2021 can export to ODF 1.3.

                    So if you hate Microsoft, I'd recommend that setup for the last version of Windows and Office you'll ever need.

                    Comment

                    Working...
                    X