Hindi OCR

Howdy
Does anybody know of any OCR package that can handle Hindi, or generally recognize Sanskrit? I have read that such a beast was being worked on, but haven’t been able to find a finished product.
A colleague in India has a large (150+ pages) mixed English/Hindi document she wants to create an electronic copy of. It was originally printed and proofed by a publishing company and they won’t release it.
I have a feeling it will be easier to start over, but promised I’d give it a shot. Any advice would be appreciated. It is a complex document (a structured diagnostic interview for psychiatric disorders), so the software would need to be fairly sophisticated. Thanks

Sanskrit is a language separate from Hindi, and much older. No-one speaks Sanskrit (aside from the commonest catchphrases) any more besides the clergy and the occasional academe or hobbyist etc. Yet many hundreds of millions of people speak Hindi today.
Their “alphabets” are generally the same, though there are a handful of critical differences. Translating a psychiatric dissertation into Sanskrit would be a novelty but generally useless.
That having been said, my favorite Indo-linguistic-centric page has always been “Indology” at
http://www.ucl.ac.uk/~ucgadkw/indology.html . Those folks are forever talking about the newest transcription software for the languages of South Asia.

This company might have what you need: www.scicomsoftware.com/prouducts.htm

RTA:
I quote:
“Like most of the languages of northern India, Hindi is descended from Sanskrit. Hindi and Urdu, the official language of Pakistan, are virtually the same language, though the former is written in the Sanskrit characters and the latter in the Perso-Arabic script. Pure Hindi derives most of its vocabulary from Sanskrit, while Urdu contains many words from Persian and Arabic. The basis of both languages is actually Hindustani, the colloquial form of speech that served as the lingua franca of much of India for more than four centuries.”

To sum up, I separated Hindi (specific) from Sanskrit (general) in my post for precisely this reason. Since Hindi is based on Sanskrit and uses Sanskrit characters, I am Interested in OCR for Sanskrit characters in general, and the Hindi language in specific.

It is a structured diagnostic interview for a broad spectrum of psychiatric disorders, designed to be used in genetics studies (called the DIGS for Diagnostic Interview for Genetics Studies), and though translating it into Hindi (using Sanskrit characters) is a moderately novel enterprise (which is what makes it so valuable) about the last thing it is is useless, according to a large number of Indian psychiatrists, not to mention NIMH (who funded the project) and many members of the international science community. Translating it into the Sanskrit language would be fairly silly, but since I specifically stated that is was translated into Hindi I don’t understand why you felt the need to mention that.

Ahem. Sorry, you got my goat a little with that post.

That having been said, thanks for the link, I’ll check it out.

And thank you Gilligan. Glad to know they have Internet access on the Island.

I’m not even interested in where you got that quote, because it is inaccurate.

Sanskrit is written in what’s called the ‘devanagari’ script. Even the most formal Hindi today uses a form of ‘nagari’ which is (comparatively) all shot through with slang and shortcuts. Even the quickest familiarization with both will reveal a few critical differences.

Around 400 BC, as Sanskrit trickled down from the Brahmins to be used among the lower castes, it morphed into a lingo called Prakrit (Samskrta = “the right way to talk”, Prakrta = “the way the commoners talk”). Prakrit is much simpler and ‘street’, with fewer vowels and squashed-together consonants.

Variants of Prakrit were used widely by folk throughout India until we see around 1000 AD it collapsed in favor of regional dialects of Prakrit which develop on their own … Bengali, Punjabi, “old” Hindi etc. It was actually “old” Hindi that was “the lingua franca of much of India” your quote refers to. (By this time Sanskrit was a LONG dead language; used only by formal poets and clergymen etc., not a language you’d hear or read on the street, not even a ‘mother tongue’ really - more of a language of exclusivity and high culture, a la Latin in medieval Europe.)

Hindustani is actually the name generally given to the particular (northwestern) dialect of “old” Hindi which became the common root of both Urdu and “modern standard” Hindi (itself notably different from “old” Hindi). (Today these languages are hardly “virtually the same”.)

So what you need to do is find something that can handle “the nagari script of modern standard Hindi” and forget about that S-word altogether.

Glad I could help.


I’m a loner, Dottie … a rebel.