Script to split a text document by page count? (Perl, applescript)

I’ve been hunting for days and I can’t believe no one has written a script to do this.

I’m on a mac and I don’t know from scripting to begin with, but I know that I can use Applescript and Perl without much fuss.

All I’m trying to do is take a .txt document that is actually a mail merged document that is the essentially the same three pages repeated hundreds of times and split it into separate files based on those three pages, and save them as documents/files named after a particular word on the second line of the document.

I know it’s easy to say it seems easy when I don’t know how to do anything to begin with, but still… .it does seem easy. A friend who is very clever about these things says Perl scripts working with text are everywhere…but I’m not finding them.

So…any tips of help would be much appreciated.

Probably you aren’t finding such a script because anybody who’s written such a thing has probably done it as a one-off that worked for their particular case and none other. It’s seriously probably a five or ten minute job for somebody with any kind of scripting language experience, but I can’t imagine it would be worth spending any time publishing such a thing, since it would have to be modified anyway for anyone else.

on Craigslist computer gigs you can probably get it done for $5. All the more if this is not secret stuff so that you can just send the file to a programmer (that way he could do it using a non-Mac development environment since there are relatively few people familiar with AppleScript out there).

Particularly the part about taking the file name from the second line.

I’m pretty sure you’d need to know a unique phrase at the end and/or beginning of each file, as well as a unique phrase before and/or after the name. It would be a lot easier than trying to parse lines or pages.

Do you still have the original template and database? It might be easier to start from there than the outputted file.

if it is actually the same 3 pages, why not just save those 3 pages from a text editor?

I’m probably missing something, but if the problem is what it appears to be, writing even a simple script in perl would be more work.

If you can give me a slightly more concrete description of the file format (how to identify the start and end of each file and how to grab the filename), I could throw together a Perl script for you.

A mail merged document is the almost exactly the same for each set, but the names (and/or possibly other information) are changed in each one. It’s the output of a form letter.

The non-computerized equivalent would be a form where everything is already written, but they leave blanks for the names. Just saving the first three pages would be like just filling in one person’s name, and then sending a copy of that to everyone.

I’m sure there are more elegant ways to do it but this is the first thing I came up with after reading the OP:



#!/usr/bin/perl -w

$text = "";
$line = 0;
$word = "";

while(<>) {
  if (/separator/) {
    open(OUTFILE, ">$word.txt") or die "Cannot open $word.txt
";
    print OUTFILE $text;
    close(OUTFILE);
    $text = "";
    $line = 0;
    $word = "";
  }
  else {
    $text .= $_;
    $line++;
    # If this is the second line after the separator, take the first
    # word after "Hello, " to use for the filename.
    if ($line == 2) {
      if (/Hello, (\S+)/) {
        $word = $1;
      }
      else {
        die "Unable to determine the name for the file!
Line is: $_
";
      }
    }
  }
}


Since the OP is not a 100% unambiguous spec, I made some assumptions:
[ol]
[li]The input is a single text file.[/li][li]It contains a number of sub-documents, each ending with a line containing the word ‘separator’. (Note: the last sub-document must also end with a separator!)[/li][li]The separator line itself will not be written out.[/li][li]The second line of each subdocument contains the pattern “Hello, xxx” whereby xxx.txt is the name under which that subdocument must be written.[/li][li]The fact that the subdocuments are mostly the same, and that each consists of multiple pages, is not relevant for the program.[/li][/ol]

This is the test document I used:



bla bla bla bla
bla Hello, w1 bla bla
bla bla bla bla

separator
bla bla bla bla
bla Hello, w2 bla bla
bla bla bla bla

separator
bla bla bla bla
bla Hello, w3 bla bla
bla bla bla bla

separator


To test the program, I would run perl split.pl testdoc.txt. Make sure that the pattern after "Hello, " does not contain any invalid filename characters, and that it will not overwrite an existing file in the directory!

Wow, that’s a pretty big Perl program for what could be accomplished by a single Unix command: csplit.

I hear Mac OS X is a BSD variant, so possibly it includes csplit by default. If not, it can be obtained as part of the GNU coreutils package.

I could do it in FileMaker easily enough. (Hey, I use this hammer like Thor and everything in creation is my freakin’ nail!)

If I knew my way around grep, I’d AppleScript BBEdit or its little brother TextWrangler.

csplit is located in /usr/bin/csplit on OS X 10.5

Thanks for the tips thoughts and attempts, all!

I’m interested in the csplit thing, and that script that Walton wrote looks possible…but im SO much of a clueless noob that I don’t even know how to implement it.

In the meanwhile, I did come up with a long way around that still works.

The “Mail Merge Form Letter” is a page of HTML from my website… it’s the gallery page, which is identical for thousands of pages, with the following differences: names of thumbnails, names of pictures they link to, page that “previous” links to, page that “next” links to. As follows:

The name of this page is 40ccba8621-8640.htm

If you know how to read HTML, you look at the bold red words, you can see what I’m doing: there is a consistent naming/sequencing scheme to the photos and thumbs, and the pages are named after the photos and thumbs within them, and it’s all very linear and logical.
It’s also a completely tedious pain in the ass to update if you are not a detail-oriented person. Everyone who has been responsible for updating aside from me could throw this together in 5-10 minutes with the new images, enter a couple of references to the update elsewhere on the site, and be done. Me? I struggle with making sure eveyrthing is correct for two hours.

On the other hand, I spent about 20 hours trying to figure out how to automate most of it, and I did.

I created a database where all I have to do is enter the basic information in about 4 fields, and just duplicate it endlessly, it generates all the data: the image names, the page names and references, everything. In a flash.

Then I took the above text, slapped it in a Pages doc (Word would work, too) and entered the correct fields.

Merge.

Big fat document… how to split?

Well, the merge creates 3 page sections for each separate doc. So I printed to a PDF, opened it in Acrobat, used ACrobat’s “split document” function to split it into individual PDFs for each page, then ran “multiple export” to “accessible text” to convert it all to text, then ran them all through A Better Finder Rename with a 2 step command to change it all to 40ccba, starting number, step by ten suffix of -.htm, then insert the second set of numbers.

And I end up with:
40ccba9941-9950.htm
40ccba9951-9960.htm
40ccba9961-9970.htm

etc…and yes, the names correspond correctly to the information within.

It’s the long way around, but it worked.

Now I have to do it for 20ccba and 40febo, etc. But it’s WAY easier in the long run for ME and my ADD brain.

Next up: Folder actions that, when I scan a photo into a foilder, it resizes, creates a thumbnail, watermarks, enters EXIF data, and uploads to the server!

Because of course now that I have (or will have) all the gallery HTML pre-written, all I need to go live is add the actual photos.

And what’s really kinda silly about all this? Is that what I’m really going to focus on as soon as this is all under control is converting my whole website to either Joomla + Gallery or Drupal + Coppermine or vice-versa.

But I can’t learn those things and make the change without having the old-fashioned existing system under control while I do.

So that’s what I’m doing.

Okay, I’ve got a nice two-step solution for you.

Step 1:
csplit -z inputfile ‘/<!DOCTYPE/’ ‘{*}’

Where ‘inputfile’ is the giant mailmerged file. What this does is splits the file up into a number of files named xx## where ## is an incrementing number. It splits on <!DOCTYPE, which makes a nice separator between pages. It doesn’t do any checking to make sure <!DOCTYPE isn’t occurring in the middle of a page, but that should be a fairly safe assumption to make. So at this point, the only thing left is to rename the files based on their contents.

Step 2:
./rename_files.pl xx*

Where rename_files.pl is the following Perl script:



#!/usr/bin/perl

use strict;
use warnings;

for my $file (@ARGV) {
    my ($first_img, $last_img) = '';
    open(F, '<', $file)
        or die "Error reading $file: $!";
    while (<F>) {
        if (/HREF="photos\/(.*?)\.jpg"/) {
            $first_img = $1 if ! $first_img;
            $last_img = $1;
        }
    }
    close(F);
    next if ! $first_img || ! $last_img;
    (my $prefix = $first_img) =~ s/\d+$//;
    $first_img =~ s/$prefix//;
    $last_img =~ s/$prefix//;
    $prefix =~ s/-$//;
    rename($file, "${prefix}${first_img}-${last_img}.htm");
}


Standard caveats apply: It was a quick-and-dirty script. It doesn’t do much in the way of error checking. If it gets confused, it’ll either leave the file unrenamed (xx##) or rename it to something oddly long (40ccba1234-50ccba-5678.htm). The former would happen if it can’t find any picture names at all, and the latter would happen if you have 2 different prefixes on your pictures.

First of all, thank you very much for your time and assistance, I really appreciate it enormously.

So… I tried out csplit. When it wouldn’t work no matter what I did, I researched it. And tried and tried and tried, and kept getting an assortment of rejections, most of them having to do with DOCTYPE not being a directory… I tried messing with the doctype part of it, nothing was working. (what is the -z?)

I stumbled on to simple “split” command, and messed with that, just to see if I could get SOMETHING working, and eventually I did. The only thing I have managed to successfully achieve is splitting the file into 4k pieces, which doesn’t help, of course, because they are not the right pieces. But I tried that when multiple attempts at getting it to split on a LINE (It appears that each page is 135 lines long) absolutely would not work… it just kept duplicating the whole file with a new name.

The total number of separate “docs” in the long file is 100. There are 13,504 lines. I have repeatedly tried:

and all it does is copy the file with the new name.

I tried saving it as .htm first, I tried actually “assigning” the line numbers using TextWrangler. Nope.

Why doesn’t it like my split by line command?

A record of my attempts:

When it did split the doc, it would do so in a way I could not figure out. It either made a hundred docs and only one had any content, or, when I was trying to figure out how to make “line_no” work, and I entered just raw numbers, it would split the file in two, the size of the two parts changing based on the number I used, but how I don’t know.
VERY frustrating

( At one point I replaced the DOCTYPE line with one word, no code: BREAKLINE, in hopes that maybe the coding in the HTML was the problem. Nope.)

Have you ever used the command line? If not, there are a lot of “gotchas.”
First of all. when you call csplit,. type the following (without the quotes) "csplit -z "
Note the space at the end.
Then, drop the file you want to split onto the terminal window.
Type a space.
The, type the following with the single quotes: ‘/<!DOCTYPE/’ ‘{*}’
Hit return and report back.

Thank you, and here is my report:

-z tells it not to write out a file if the file’s empty. Otherwise, it creates an empty file to hold the nothing before the first <!DOCTYPE. Obviously, it’s not strictly necessary, but it does make things a bit tidier.

Anyway, it looks like your version of csplit’s a little different from the one I tested with (which was on a Linux box). Rather than fighting with it (since it looks like you’ve tried all the possible ways to get it to work and then some), it might be easier just to skip it and use another Perl script to do the splitting. I’ll throw something together right now.

Try it without the -z
I’m not sure what that is supposed to do, and the OS X version doesn’t support it.

Ok, here’s a quick and ugly Perl script that can replace csplit:



#!/usr/bin/perl

use strict;
use warnings;

my $out_count = 0;
open(F, '>', 'initial_garbage')
    or die "Error creating output file: $!";
while (<>) {
    if (/^<!DOCTYPE/) {
        open(F, '>', sprintf('xx%04d', $out_count++))
            or die "Error creating output file: $!";
    }
    print F $_;
}
close(F);


Save it to ‘split_by_doctype.pl’, and then do ‘chmod 755 split_by_doctype.pl’ to make it executable. Then step 1 would be:
./split_by_doctype.pl inputfile

The file ‘initial_garbage’ will probably be empty and contains any data before the first <!DOCTYPE. Anyway, just run this, then run the rename script, and you should be golden.

Oh my GOD… it did exactly the same thing the csplit did many times: I got back 2 files, one was 276kb, named xx0000, the other was 0kb, named initial_garbage.
ARG!