Wikipedia says PostScript is readable; I converted pdf to ps and it looks totally greek; why?

code_grey · August 3, 2010, 8:11pm

Wikipedia article says that PostScript is a programming language, with method names like moveTo and structure that is not pretty but still meaningful to a human. Well, so I used some online service to convert a pdf file to postscript. In the resulting gibberish there are precisely 2 occurrences of “moveto”. But there is a lot of stuff like this:

437 900 25 28 /6D
$C
-Pd7SFQf&P61iu^3eX:YY#0N"[K@s+LT?&qcdcE[1"Hl5Q~>
,
462 901 15 35 /1O
$X
!.FtJ!WE3#"5j9s"2G#S"2G#S"2G#S"2G#S"2G#S"2G#S"2G#S"2G#S"2H#X+9!A]"2G#S"2G#S
“2G#S"2FoP!‘g~>
,
476 901 10 36 /6H
$X
"2G#S"2G#S"2G#S"2G#S"2G#S"2G#S"2G#S"2G#S"2G#S"2G#S"2G#S"2G#S"2G#Szz!!!,:"2G#S
"2G#S~>
,
487 900 26 28 /1S
$C
,RtB)K;V+bo^nWNoXr`os8W,-8:’.H=DMBJ6VtTa~>
,
526 900 36 37 /6L
$C
2$jBq&9R,omXbnp6+oF”:J1BI,;BTs8CjG?FHI&V7!PUYejEj)MSqC:]~>
,
562 900 1G ,
590 901 5Z ,
631 891 28 37 /1W
$C
3<0$Yq6Jr]l/M7\4<WLQ=5Wjms8W,udf.j@['gc%EEk2l_LM~>
,
659 900 27 28 /6P
$C

I also am not able to find text segments from the original file either.

So what gives? Is this file somehow encrypted so that by decrypting I could get to the underlying text and other info?

Are there conversion programs, drivers and similar out there that would generate the kind of “meaningful” PostScript that Wikipedia seems to promise?

I am interested in applications like text scraping pdf tables in cases where parsing of text dump is ambiguous. I am aware of existence of tools that directly parse pdf (and welcome any interesting info about that as well) but still just wanted to look into the PostScript approach.

ftg · August 3, 2010, 8:51pm

Are you sure you have a PS file?

Does it start with something like:

%!PS-Adobe-2.0

and end with something like:

%%EOF

(Lines that start with one or two "%"s are used to setup header, trailers and such.)

I looked at a short PS file I had lying around, almost all of it is jibberish looking stuff having to do with setting up fonts and a bunch of ther things. Basically binary data encoded in ASCII. The more complicated the file, the more such stuff is in the file. Especially if there are images. (Which can be encoded in an LZW sort of way.)

But near the end was the actual text. Most t(ext)t sor(t) of (l)ook(s) l(ik)e (t)his. The parens let the PS engine know how to space out the letters.

If the PDF file was an “image” of text rather than text, then you’re not going to see anything meaningful.

Note that the code to, for example, decompress compressed data inside a PS file is included in the file itself. Outside of images and such though I have never seen a PS file that tried to compress the text part. But there can be a lot of code inside a PS file. A lot.

ftg · August 3, 2010, 9:01pm

Also, note that there is a utility called “ps2ascii” which strips out a lot of stuff to get a semi-readable text file. It is part of the standard eps processing package under Linux and can be run under MS-Windows using cygwin. If you have that lying around, run it on your PS file and see what you get.

code_grey · August 3, 2010, 9:36pm

yes, it does.

Pdf file is not an image - I can extract/copy text out of it without problem.

TimeWinder · August 3, 2010, 9:46pm

Some of those are a binary data chunk, they can be encoded in a variety of different ways, some of which are basically just raw bytes. And the things with the / before them are function calls (the series of numbers before them are the arguments, on a Forth-like (RPN) stack).

While it’s true that Postscript (and to a lesser extent, PDF) are programming languages, and therefore can be written to be readable, they are usually generated algorithmically by a print driver or similar tool, which is optimizing for transmission speed and file size rather than readability…after all, who’s going to read it? Hence all the mechanical-looking function names “6L, 6D,” etc. They’re not going to be read by people, so just giving them short, incremental names makes the file shorter and saves on memory.

Also, some PDFs are encrypted, and the whole file will look like that by design, to prevent folks from reading or copying portions of it.

(PDF doesn’t have iteration, by the way (Postscript does), so it’s not a general purpose language, just a markup format that’s kinda procedural.)

TimeWinder · August 3, 2010, 9:50pm

And in your case, you went through TWO layers of machine obfuscation (the original print-to-PDF and the converter), so it’s not too surprising that you can’t read much of it.

code_grey · August 3, 2010, 9:51pm

ftg:

I looked at a short PS file I had lying around, almost all of it is jibberish looking stuff having to do with setting up fonts and a bunch of ther things. Basically binary data encoded in ASCII. The more complicated the file, the more such stuff is in the file. Especially if there are images. (Which can be encoded in an LZW sort of way.)

But near the end was the actual text. Most t(ext)t sor(t) of (l)ook(s) l(ik)e (t)his. The parens let the PS engine know how to space out the letters.

If the PDF file was an “image” of text rather than text, then you’re not going to see anything meaningful.

Note that the code to, for example, decompress compressed data inside a PS file is included in the file itself. Outside of images and such though I have never seen a PS file that tried to compress the text part. But there can be a lot of code inside a PS file. A lot.

I get your point about it messing up the text strings. But, like I said, I counted the moveto function and it occurs in only 2 places, around the top of the file. How can you print document with 7 pages of text including tens of rows in tables with just 2 moveto?

My assumption is that they cannot mess up method names used for text formatting, unless they are compressing the source code. Or can they?

Which method names would you normally expect to find in quantity in a text heavy PostScript file?

Chronos · August 3, 2010, 9:55pm

If it’s a simple document, you shouldn’t need any moveto at all. And the method names are completely arbitrary, which is why the automatic PS generators can get away with enigmatic names.

TimeWinder · August 3, 2010, 10:04pm

Even more specifically, “moveto” is six characters long, so re-defining it to 3 (say /6L) is a 50% savings. And in the case of the table, it might very well be drawing rectangles: 1 moveto and 4 linetos, plus their arguments, per rectangle: you can easily write a function that converts that to:

<x> <y> <width> <height> /R
(e.g. “10 10 100 100 /R”)

and all of those movetos and linetos disappear in favor of the more concise and useful form; basically the same abstraction you’d do with any language that lets you define functions.

BigT · August 4, 2010, 3:31am

If you’re just trying to get the text out of the file, still in table format, just use an HTML converter. HTML is definitely human-readable.

Musicat · August 4, 2010, 3:38am

Personally, I strive to eschew obfuscation.

si_blakely · August 4, 2010, 8:24am

The output you are seeing is the result of how the PDF to PS conversion has been done - since the PS output can not make assumptions about the final output device, font information has been included into the output. This can be done by loading a font definition into the postscript file, or by just converting each glyph into a curve on the page - which removes the raw text. There may also be compression on elements of the file.

The problem with any PS/PDF to text tool is that it assumes that text has not been converted to curves, and it also assumes that text has been placed linearly on the page. In my experience, neither of these assumptions are valid. I used to do quite a bit of postscript hacking back in the mid 90’s, as we moved from MASS-11 wordprocessing on VAX/VMS to Wordperfect (on DOS, Windows and VAX/VMS) for scientific papers. The point was the output had to be perfect, and for many years, it wasn’t. And the Wordperfect PS output drivers were terrible, almost to the individual characters being located specifically on the page to get the kerning and justification. And don’t get me started on superscripts/subscripts. I wish we had just used TeX.

Si

Topic		Replies	Views
Explain Postscript to me Miscellaneous and Personal Stuff I Must Share	6	895	January 18, 2005
PDF: What's the point? Factual Questions	51	2886	April 25, 2003
Why Acrobat(R)? Factual Questions	31	1729	June 23, 2003
Why do some .pdf files print so deucedly SLOW? Factual Questions	14	30501	May 2, 2007
Extracting text from a pdf file Factual Questions	24	1717	November 14, 2008

Wikipedia says PostScript is readable; I converted pdf to ps and it looks totally greek; why?

Related topics