Thank you, that was heart warming and

I have never been more encouragd and thankful to Free and Open Source communities. Three months ago I postd a request for help with OCR’ing and processing 19th Century Newspapers and we got soooo many offers to help. helpful– already basd on these suggestions we are changing over our OCR and PDF software completely to FOSS, making big improvements, and building partnerships with FOSS developers in companies, universities, and as individuals that will propel the Internet Archive to have much better digitize texts. I am so grateful, thank you. So encouraging.

I post a plea for help on the Internet Archive blog

Can You Help us Make the 19th Century special database Searchable? and we got many social mia offers and over 50 comments the post– maybe a record response rate.

We are already changing over our how to find ncsecu routing number online OCR to Tesseract/OCRopus and leveraging many PDF libraries to create compressd, accessible, and archival PDFs.

Several people suggested the German government-lead initiative call OCR-D that has made production level tools for helping OCR and segment complex and old materials such as newspapers in the old German script Fraktur, or black letter.

The Internet Archive had never been able to process these, and now we are doing it at scale

We are also able to OCR more Indian line data languages which is fantastic. This Government project is FOSS, and has money for outreach to make sure others use the tools– this is a step beyond most research grants.

Tesseract has made a major step forward in the last few years. When we last evaluat the accuracy it was not as good as the proprietary OCR, but that has chang– we have done evaluations and it is just as good, and can get better for our application because of its new architecture.

Underlying the new Tesseract is a

LSTM engine similar to the one developd for Ocropus2/ocropy, which was a project ld by Tom Breuel (fund by Google, his former German University, and probably others– thank you!). He has continued working on this project even though he left academia. A machine learning based program is introducing us to GPU based processing, which is an extra win. It can also be trained on corrected texts so it can get better.

I post a plea for help on the Internet Archive blog

The Internet Archive had never been able to process these, and now we are doing it at scale

Underlying the new Tesseract is a

Related Posts