Maddening problem with downloading text

Every now and then I need to download some text from a website; most often it’s legal text. This time it was a California ballot measure that passed in the last election.
To save space (and paper) I copy it into a Word format. The problem is a maddening symbol that appears when I click on the “paragraph” icon on the toolbar: a “broken arrow” that the nitwit who formatted the document when uploading it, put at the end of EACH line in the document. I want to put it onto a Word document, but it has 276 pages, about half of which is empty space. I don’t have that “broken arrow” symbol in my working font, so I can’t simply use the Replace function to delete the “broken arrow” from the document and give the text normal margins! Is there any way around this? :mad:

You’re just copying and pasting the text from the website into a Word document?

If so, try pasting it into plain .txt with Notepad or WordPad as an interim step.

Well, you are using the wrong tool.
Use a text editor, not Word. On the Mac, I’d suggest Text Wrangler.

The “broken arrow” is probably are hard return - you could try selecting from the end of one line to the beginning of the next, and pasting into the search & replace field.

My guess is that you’re seeing a non-printing character for a line break. You’d have to check your MS Word view options to be sure it is displaying that character. If I’m right, you can access that character in a Find/Replace by typing ^l into Word’s search box.

These manual line breaks would not likely be the fault of the web programmer, but of how the copy/paste works between Word and your browser. I’ve seen this before. It’s common with PDF files as well.

As an alternate solution, you might try “Save As” from your web browser. Save the html file and open that with Word.

I used Notepad to copy the document. Then I transferred it to regular Word, and did the formatting from there. (To save the paragraph demarcations I did want I keyed in “XX” at each one. Then, I used Replace to remove all those Paragraph symbols; then I replaced the “XX” marks with new Paragraph symbols. Best formatting idea you (and I) ever came up with. :slight_smile: )

There is a simpler way to do that, assuming that there is a hard line return between paragraphs along with the one return after each line. Just do a find/replace for all double returns (^l^l) and replace them with a paragraph mark (^p), then replace all the single returns that are left with a space mark.

Is that an upper-class “I” or lower-case “L”?

They’re lower-case Ls. (I found out by copying and pasting into Word, then typing an upper I and a lower L and changing fonts until the two characters were distinctly different, then comparing what I typed to what I cut-and-pasted.)

Try this:

Advanced search (CTRL-H). Search for two line breaks (^l^l), replace with paragraph break (^p). That’s assuming they’re putting two lines between paragraphs.

Then search for line breaks (^l) and replace with space (just hit the space bar once).

That should give you text broken up into paragraphs. There may be glitches here and there, depending on the state of the text you’re copying, but it will get you close, anyway.

Look at this. It’ll show you where the menus are. Just note which caret+character combos it sticks in the search box.

Are you pasting these into Word to print them, or read them later? If I were faced with 276 pages of text, I wouldn’t read it all despite the time I’d put in (re)formatting it.

I’m seeing if I can save you some time here. If you’re just skimming, or want to feel like you’ve done some due diligence by making an effort to read some of a proposition, why not just skim/selectively read it on the original site?

I’ve had the same formatting problems, and after spending more time formatting than reading, I was honest with myself and realized that I should just skim it instead… or search the original page for “inheritance” or whatever the issue was that I wanted to focus on.

When I prepared the copy I had downloaded with Notepad, I found that it included ALL of the ballot measures from the election! :o So I deleted everything but the specific item I wanted and got it down to only six pages all told. The item in Google was somewhat misleading. Quite a saving of time and paper (and ink). :slight_smile:

For any of this type of stuff, on Windows, i highly recommend Notepad++.

It is so superior to plain old Windows Notepad that there’s almost no comparison.

Thank you so much! I’ve have the same problem as the OP for decades. I always wanted to use Find/Replace, but I could never figure out which character to put in the Find field.

I disagree. There are many times that I copy from a website and each line ends with a paragraph mark, and other times that I copy from a website and each line ends with (what I now know is) a line break. Therefore, since I am using the same browser and the same version of Word in both cases, I conclude that the difference has something to do with the website itself, and not the browser or word processor.

If you can’t type a character, you could always copy and paste it.

I’ve done that many times with search and replace. The only downside is what happens when you replace ALL the carriage returns with space. The text may look far more dense. Maybe someone else has a better solution for preserving paragraph breaks, etc.

<<Are you pasting these into Word to print them, or read them later?>>
I print them out; usually these are cases from California Appellate Courts or the California Supreme Court. I reformat them to save paper, ink, and space.
The problem is that some documents seem to be formatted with each line treated as a paragraph. Hence the “broken arrow.” Now, however, I know how to overcome them… I wish I could overcome the damn ads hogging space on the right end of the screen on the SDMB. :mad:

It’s easy in Word. Assume all paragraphs end with x carriage returns.

  1. Replace all carriage returns with some unique character string, for example “-=”.

  2. Then replace all x consecutive occurrences of that with one carriage return.

  3. Then replace all single occurrences with a space.

If some paragraphs end with a different number of carriage returns, you may have to do some manual corrections. Or you could come up with a modification of the above steps to handle that.

One would think so, but that has not been my experience with these “broken arrows”, possibly because the “broken arrow” is not really a character, but is a visual representation of “line break”.

If anyone wants to experiment, I found this very SD page to provide many examples. Within any individual post, there are is one paragraph mark at the very end of the post, and another after “Last edited by”; all the others are is line breaks. But there are plenty of paragraph marks elsewhere on the page.