Tesseract 5.0 Released For This Leading Open-Source OCR Engine

Written by Michael Larabel in Free Software on 1 December 2021 at 05:35 AM EST. 16 Comments

The long-awaited Tesseract 5.0 is now available as a big update to this leading open-source, optical character recognition (OCR) engine that via neural networks offers great accuracy and supports more than 100 languages for turning images of text into actual text.

Tesseract 5.0 had been available as alpha since the end of 2020 and the Tesseract beta was released in August. On Tuesday, Tesseract 5.0.0 was officially released. Tesseract 5.0 delivers on faster performance via "fast floats" to use floats instead of doubles now for its LSTM model training and text recognition. This should lead to much faster training and OCR performance while using less system memory.

Tesseract 5.0 also has native support for Apple Silicon, build system enhancements, API improvements for its library, better ARM support, and more. There are also other code improvements besides fast floats that should further help Tesseract's OCR performance.

Tesseract development originated at HP decades ago before being open-sourced in 2005. Google took over developing this OCR engine after it was open-sourced but in 2018 they stopped contributing as much to the effort, which seems to be partly why Tesseract 5.0 took so long to materialize. Much of Tesseract's recent activity has been by Stefan Weil of the UB Mannheim.

Tesseract 5.0 downloads and more details on this big open-source OCR update via GitHub.

16 Comments