pdf to txt without extra line breaks

I’d like to make my pdf’s a bit more readable on the kindle, but when I copy
the text from a pdf, and paste it to a txt, it usually includes line breaks where
there really aren’t any. This is just fine to read on a computer screen, as the
lines are of the same lenght as in the pdf, but with the kindle’s narrower screen
it is pretty ann-
oying to read. Is there some fix for this?
Alternatively, can you just tell everyone you know to ditch pdf for everything but print preview, and start using some ebook format, and pass the word? thanks.

Rather than copying directly from the pdf, try saving the pdf as a text file, then opening that file in whatever application you’re using. Haven’t tried it on a Kindle, but it usually works with MS Word (depending on how the pdf was created).

I’ve written little scripts to strip out extraneous line breaks in most every word processor text editor I’ve ever used. It usually involves something like this:

  1. replace

(new line or whatever is being used to mark line feeds) with $$$$ (or some other unique string) This flags the double returns for later restoration.

  1. replace
    with the space character (getting rid of the single line feeds.

  2. replace $$$$ with

(restoring the paragraphs)

I’ve added tweaks over the years like searching for multiple spaces and other artifacts.

You can also do this in a good text editor like NotePad++, without writing a script.

One problem, though, is that for some PDF books and other files, when you bring the text in from the PDF, there is essentially no difference between a new line and a new paragraph, at least in terms of line breaks. I just coped text from a PDF book i have on my computer, and the line breaks and the paragraph breaks are both simple carriage returns.

There were three spaces before the beginning of each new paragraph, though, so you could get your script or your text editor to look for things like that.

**Saintly Loser: **I tried saving one as txt, but it didn’t help. Might with others, though.

**standingwave: ** How would I make a script like that? Would you mind pasting the code if you have such a script for word? (I’m using 2010 if it matters)

**mhendo: ** How do you do it without a script, then?

Get yourself a good text editor, not just the shitty Notepad software that comes with your computer.

An excellent free one is NotePad++.

Once you have it on your computer, open it and paste in your text from the PDF. Then, go to Search > Replace…, and when the dialog box opens up, make sure that the Extended search mode is selected in the bottom left, and that Match whole word only is deselected.

Then, in the Find what box, type
and leave the Replace with box blank. This will remove all line breaks from your document. It will, however, also remove all paragraphs, leaving you with just one long paragraph.

As i said above, some documents will have some other formatting that might allow you to keep your paragraphs. So, as i said earlier, i opened a document today where there was a three space indent before each paragraph. In those cases, before you do the replacement of the line breaks, you could do something like what standingwave suggested, where you replace all instances of a carriage return plus three spaces with a filler code, like $$$.

So you would replace
[spacespacespace] with $$$.* This would get rid of all line breaks, and would leave you with a document that has a whole lot of $$$ in it.

Then, you replace all instances of $$$ with line breaks, which should give you your paragraphs back.

Anyway, download NotePad++ and play around with it a bit. It doesn’t cost anything, and it might work for you.

  • [spacespacespace] means hit the space bar three times; don’t type in [spacespacespace]

worked wonders! thanks a bunch

Looks like you already got an answer and I would concur with it - use a text editor. You can do it in Word, but it’s a little more cumbersome. In Word, it would involve creating a Macro and as noted upthread, it’s not exactly a one-size-fits-all scenario. There can be minor variations from document to document. The trick is to find the string of characters that are unique to paragraphs and to temporarily replace it some unique character string. Then replace all the single line-feeds, carriage returns, etc. with spaces. Then replace all the double spaces with single spaces. Then go back and restore the paragraphs by replacing the unique character string with double line feeds. The end result will be a document where the only line feeds are at the paragraph breaks and the characters within each paragraph are free to word wrap. HTH.

It’d probably be easier to just use this webpage: Remove Line Breaks Online Tool

Have you tried using Calibre to convert your PDFs to a more Kindle-friendly format?