Document scanning software that gives high quality but small files?

idk whether the software goes with the scanner machine and is fixed or not, but if we can choose, then I want to ask the best one there is for my need.

So, I have some documents that I want to scan. I have noticed that there are 2 types of pdf files that appear to be just pages of words but are fundamentally different. 1 is pure words, and has very small size. The other is actually images of a page: we can read alright, but we can’t use the Select tool to highlight any words; and the size is large.

Now, the place where I’m going to upload my scanned file has a (pretty stupid) small limit on file size. I’ve tried using a few scanning app and they result in bloated ones. Attempts to reduce the file size using online tools resulted in broken pixels, still eligible but very ugly and annoying for readers.

So I guess what I’m looking for is a scanning software that can produce word-like files, or a smart conversion app that can realize characters from a picture and translate them into a document. FYI, a few of my docs have non-English words - but being able to have good files for the majority of them is still a great boon.

Thanks!

Just for fun:

Do you have the .pdf available in any other format than paper?

What you need is a scanner, or post-prosessing app, that includes OCR (optical character recognition). There are a lot of them around, but I rarely need it and have just gone for the first free one that worked, and I think I just used an online one last time.

As long as all the letters scan cleanly it shouldn’t matter what the language is.

Good question! I assume the OP is talking about paper documents, but if they’re word processor documents (in electronic form), all he has to do is save them as PDFs from his word processor.

DjVu is worth a try. Scanned black-and white text will be decently compressed to ~50 kB/page or thereabouts, while the selectable text is still there in an OCR layer.

If you must produce a PDF, you can get similar results in PDF format, just make sure you use the right kind of compression (JBIG2) to keep the file size small.

I use Abbyy.

https://pdf.abbyy.com/

(No matter what OCR you use, plan to spend anywhere from a little to a lot of time editing. Especially if you want the copy to be a visual match for the original.)

Ah, this helps reduce my hassle a lot. Knowledge (of the keyword) is power!

I typed those docs in Word, yes, but for them to “count”, they need a real seal from ‘authority’. So the current approach is to print the docs into physical forms, get the seals (and signatures), and scan them back.

Unless it’s all over the place, I’m fine with legible presentation and a correct reproduction of the signs & seals.

DjVu seems not to be a standalone app, but a… method or something? Cause I can’t pinpoint one with only the name DjVu (and) OCR. Abbyy is not free, so currently I’m looking at GImageReader and FreeOCR. What do you guys think of them?

If it is something that has a signature and a stamp that need to be there then OCR is useless to you, you are stuck with bigger files, and the answer to your problem is to find a better place to host them.

And dejavu (wow, I haven’t thought of that one in years) is a file format (like jpg or pdf or doc) not a program.

Just scan the seal and add it digitally, print as pdf.

Small files, high resolution.

If a scan of the printed documents will do the “realness” isn’t really important.

This is what I do with my signature. I have a scanned in version that I would use for documents. It was great for faxing, but even works for situations involving email. I even got some for my parents so I could fill out their stuff.

Even today, there are medical and government forms that are available online but not as a proper form document (with fields you can type in), and which want an old-fashioned signature.

I use Adobe Scan on my phone regularly. Seems to produce a good result. Just checked a document size and it was 805kb, so that might be too large for you, but you can open the file in Adobe Acrobat and compress it there. I assume you can use the desktop variety of Acrobat to compress an existing file.

Free software from the original creator of the pdf format.

Ahh… if I’m reading it correctly, then you were suggesting scanning the seal as a separate jpg/pdf file, then manually add that image into the text doc file. (Kinda) best of both worlds.
Just curious: the reader, if they want, can they spot something off and tell the document was manipulated?

By its definition, doesn’t it seem this format is ideal for my type of document? It can store both raw text as characters and the stamp & signature as pic, right? How it can recognize which is which, I guess, is based on AI-machine learning?

Was yours 805kb for 1 page? My limit is 2MB, and I have a few docs that span 10+, so I guess <200kb/page is the best. Others are short, though. And did you experience any dip of quality after compressing with AA?

Only if the other side can accept that format. Most places I’ve seen only allow PDFs for documents. (Though they may also allow PNG, JPG, and TIFF for images).

Yes, if where the OP plans to upload allows JPG, that should produce a smaller file size than PDF, Simply adjust the JPG compression level to the file size you want and still be sufficiently legible.

A format that was never widely used or widely supported in its heyday (which was years ago) and has been abandoned by most of the supporters that it once had? No, that is not ideal.

They can see it is a scan.

But the document will be much more useful: search works, select, copy etc.

The school of “thought” that thinks faxes are a secure means of communication. (Not so for any sane value of “secure”) will probably say it isn’t real. Happily they don’t have any skill to prove that, those who could see it was never on paper won’t care.

It’s pretty well established in US law that a fax or photocopy is as good as an original. And there’s nothing that says a “signature” has to be ink.

I routinely complete & sign forms that start as PDFs or as scans of paper by using the form filling & signing features of the free Adobe Acrobat Reader DC. I’ve long ago imported a JPG scan of my pen & ink sig that I expanded, cleaned up, and re-contracted in MSPaint. So it’s a beauty.

So I take the raw “form” (words and blanks on a page), type my info legibly into the empty spaces, apply the pic of my sig, and save that as a new PDF. Send it off via email, fax, or snail mail depending on how backwards and ignorant the other party’s procedures are.

Nobody, whether contractor lawyer, judge, or insurance company has ever had the slightest concerns about legitimacy. And they like the fact the form-filling is legible and the sig is dark, bold and black.

I do not think there is anything wrong with it as a format (and most multi-format readers support it), but, as I originally said, you can get the same features (JBIG2 compression + text layer) in a PDF. If the recipient expects a PDF, by all means create a PDF.

Currently recommended formats, at least for books and textual materials, are:

  1. ISO 14289-1 compliant PDF/UA
  2. ISO 19005-compliant PDF/A
  3. Other PDF (non-crappy, with searchable text, embedded fonts, high-resolution images, device-independent color, etc.)

I haven’t used the brand-name djvu software for a long time so I don’t know what it does now, but, yes, it did separate the text from the other layers automatically. Look for similar features in your PDF creator. Also, in all cases the OCR output needs proofreading, unless you have reason to be confident it is flawless.

ETA if the document was created by you digitally then there is no reason to scan anything except imported graphics, because the original text shows up via embedded fonts.

He types up the document in Word, prints a copy, has that printed copy signed and stamped, then scans that signed copy back into a digital file. He wants to make that scanned signed printed typed document available to unnamed audience x, but unnamed file host y has a low limit for document size. (I stand by “find a better file host”.)