Help wanted from some fellow sleuths-archeologists:

Recent reports suggest that a Russian OCR tool called Cuneiform has been released as Free and Open Source Software (FOSS). The unfortunate part, for me, is that all the news seems to sit on the Russian side of the web, and I don’t speak Russian.

The matter becomes extra confusing when you notice that there is an American site that presents itself as the manufacturer of Cuneiform OCR (called Cognitive Enterprises), that still sells the package (albeit a much earlier version), and that keeps remarkably mum about the whole open sourcing its flagship product thing. Does anyone know what’s going on here? Is this open source release legit?

Easily beats two other FOSS OCR offerings

Why is this at all important? Well, I took a gamble and downloaded the software, and my test results with Cuneiform are so far easily superior to those of the other two FOSS OCR offerings, Tesseract and GOCR/JOCR. Without me telling it that it had to recognize Dutch (remember, I don’t know how to tell it that as I don’t speak Russian) it managed to OCR several pages almost perfectly, leaving only 3 or 4 errors per page. The other two averaged more than one error per line, admittedly mostly because of their inability to recognize where a line started and ended. (Language recognition software, be it speech recognition or OCR, tends to pass the annoyance test if it leaves in less than 1 error per sentence.) Good OCR software is hard to produce, and is therefore invariably expensive. A cheap (read FOSS) version of a quality OCR tool has the potential to emancipate the long tail of printed text.

6 COMMENTS

  1. I’m a bit suspicious about this:
    – the .com website doesn’t mention anything
    – the .ru website (containing the “suspicious” software) is the only one talking about open-sourcing the soft.

    ==> what tells us it isn’t a trojaned .exe sitting on the .ru site? Did anyone gave a thourough search into the legibility of this? If it indeed is a valid package i’m really interrested. If it’s a good OCR but a trojaned version, i’m not. Please enlighten me ^^

    Edh.

  2. I’ve tried to contact the owners of the .com site through several channels, but got nothing but silence in response. I don’t think the fact the .com is staying silent means a lot.

    As for Trojans, that might indeed be a risk. There are people working with the source code, but those are Linux users, and their aim seems to be to get the package to run under GNU/Linux first. The fact that on their forum several hundreds of postings are advertising porn and only a few are discussing OCR doesn’t fill me exactly with hope.

    It would be a pity if this package were to die, the few test results I had were far superior to any OCR I’ve seen so far, including the output of Finereader 7 which I got via a magazine coverdisk a couple of months ago.

The TeleRead community values your civil and thoughtful comments. We use a cache, so expect a delay. Problems? E-mail newteleread@gmail.com.