Any ideas on how to automate this process with a script?

I have a highly repetitive task that I think could be automated but would require some fairly specialized OCR capabilities. Let me know what you guys think.

Basically, we have these jpeg files, they are always 640 x 480, and at a given location within these image files is a set of numbers, printed like this: 2200 / 2400

What I do is go through and rename these images from a big complicated name that is auto-generated when the pictures are captured to this: 22002400_dep.jpg

Now, the #### / #### always appears in the same place in each image, and uses a very standard easy to read font. However, the numbers are just on top of the rest of the image, so it’s not like it has a solid background. There is also a lot of other text within the image that I just ignore, so if I captured all the text in the image, I’d have to sort through and find what I needed specifically.

I’m not sure if there is any OCR software that can be called up and used in a script, or what that would really entail, but specifically I would need the OCR to only read the numbers I care about (in my images this happens to be defined by the following rectangle: upper left vertex [283,401], lower right vertex [398, 414]).

Perhaps it would be easier to just cut and copy that image into a new jpg, OCR that image, and then apply the found text to the image I want to rename?

Not sure if it’s possible but here’s the basic algorithm I’d need:

  1. Open jpeg
  2. Scan jpeg using OCR, but only the relevant part
  3. Rename jpeg using scanned text

I also need to do this recursively and do some moving of images from one directory to another but I think I can figure that much out. The hard (impossible?) part involves the OCR bit.

Let me know what you guys think if you can come up with any way to automate this, it would be great! Either Windows or Linux solutions are valuable.

Well, I just found and tried tesseract OCR with absolutely no success. I even tried just creating a new image that only contained the part I was trying to convert to text, and it couldn’t do it. It wasn’t even close (it was converting for example 2200 / 2240 to anal mu{ ). I guess the problem might be that OCR has a difficult time when text is superimposed on another image, and not on a blank background as you would normally have it.

Is the image in colour or grayscale? Does the background look like a bunch of geometric shapes or more like a fuzzy photograph? In other words, is there a way to distinguish between the background and the digits based on colour, texture, etc.?

The images are in color. The text is in white. I tried doing some manipulation (turning it grayscale, increasing the contrast, and inverting the colors), so that it looked as much like black text on a white background as possible. And my test image still failed with tesseract. I think it’s just too “fuzzy” or something.

The numbers are always in the same portion of the image? Have a script make a temporary version of the image, taking only that area as a selection. This will cut the scope of the work. Then write something that detects the noisiest areas of the image, and mask those areas out. You should be left with a selection shaped like the numbers.

Use your mask and a flood fill to recreate the letters in pure white on a black background. OCR that. Use the resultant data to rename the real file.

Are the numbers in pure white? It sounds as if you should be able to remap the colour space to wipe out the background. If you want a scriptable flow, you could start with the ImageMagick suite.

Here is an exampleof the relevant portion of each image that I need to scan and convert into text. It seems like the jpegs are just too low quality, the text too small, for OCR to read it reliably.

If anyone here has any success converting that example into an image that an OCR can correctly read, let me know… but I’m not sure if it can be done.

Are you certain there’s no EXIF data embedded in the Jpegs, that might be usable?

Here’s an online data viewer you can test one on (just the first one that popped up on a search). If they were all taken or processed in order, maybe the date and time of creation or modification would let you calculate the numbers.

If you know exactly where each digit will be, it seems like you could construct a mask image for each possible digit, in effect, positioned right over the image.

Then have a program that loops through each “on” pixel in the mask and extracts the corresponding pixel from the original image. Average the brightness of all the pixels that are supposed to be white. Then use whichever digit has the highest score. Repeat for each position.

This might work better than OCR, since it takes advantage of your knowledge of exactly where the characters must be and the limited number of character choices. All you need is a scripting language that lets you extract individual pixels from an image.

You’re all making this too hard.

Simply extract the relevant subimages and use them as Captcha tests on a porn site of your own making. It sounds like the digits will be easy for humans to recognize, so you’ll only need 2 or 3 website hits per image to recognize all the digits. Heck, you can probably get 2 or 3 images OCR’ed per customer – anyone desperate enough for porn to solve a Captcha will probably be happy to solve more than 1, especially if you reward him with a glimpse of your prettiest model after the first.

After working the image over with the GIMP, I managed to get Tesseract to spit out this:
2440- I 2200

That looks pretty good. I’d try using ImageMagick to clip just the areas with the numbers (the area for 2440 and a second one for the 2200) and OCR each one individually.
I did this:
[ol]
[li]Scale up by factor 10 - your image is 93x17 pixels, I scaled it up to 930x170[/li][li]Apply an Unsharp Mask with radius 5.5 and amount 10[/li][li]Apply a value correction - leave all values at default except for the middle source value - this is normally 1.0, set it to .51[/li][/ol]

That works pretty well if you split the image and run OCR on the two parts seperately.

It CAN be done, but the images are really nasty and have to be cleaned up a lot. JPEG is not the best format for this kind of stuff.

I used to do something like this (image masks to block out stuff I didn’t want) on a Mac, in about the late 90s (as a hobby type thing, faking pictures, not professionally, and not on numbers), and it worked pretty well. Since it was in Mac OS8, and the image masks were always hand constructed one-offs, I can’t give any useful tips on how to automate it, today, but this gets my vote.

Waitaminit! No. This is the best solution:

:smiley:

Forgot to add, charge them for ‘teh privatelodge’. But not much. If they are willing to pay 10 cents a day for their porn fix, you can make lots of money, per perv/per day, before they figure out that you’re just making them pay for getting their own chain yanked… Which they could have gotten for free, the morons…

Yank the idiot’s chain! Ring up some MONEY! That’s my motto.

“Never give a sucker an even break.”

  • ‘some bloodthirsty bastard named Barnum I’m quoting’ -

Or a perv. Oh well…

You can automate OCR tasks like the one you describe using Abby Fine Reader. It’s built for precisely this sort of thing.

Thanks for the tips all. It just seems like the images aren’t all that conducive to OCR. I tried following furdmort’s steps but my results weren’t as perfect as his. Compound this with the fact that the background isn’t always the same in every image, and it just seems like there’s no way to guarantee that the OCR results are going to be good every time. I guess I’ll just keep doing this by hand.

I still learned stuff though so thanks for all your input.

The captcha advice was particularly hilarious :smiley: