Unix Command line question

I’m manipulating data files that are ~4 Gig in size. Opening them in a text editor is slow, I presume because the editor loads the full file into memory before I manipulate it.

So question: is there a command line utility I can use to view and delete lines from a text file without loading the whole thing into memory.

Is there a pattern that can be matched to change the lines in question? Because things like sed and awk are designed specifically for processing text files line-by-line. A quick commandline Perl script can also do the job, but more details would be required.

I never deal with files that large so the following are untested.

“grep” will display lines based on keywords (search)

a combo of “cat”, “head” and “tail” can probably be used to list a specific set of lines but I’m not sure how efficient it is.

I believe “sed” can do much of that too, including the deleting bits, but I don’t use it personally.

I’d guess there are a few dozen more that can perform similar tasks.

Was thinking I could use ‘ed’, (which I learned a long time ago, and still kinda remember how to use), but it apparently tries to load the whole file when invoked as well.

I could probably use a combination of pipes and ‘head’ to edit out lines, but that seems like it’d basically have to copy the whole file over through the pipe.

Off to go look at ‘sed’. That sounds like it might be what I need.

Interesting question. I know that Photoshop and the like can load images as tiled segments, so that all parts of the image are never in memory at the same time. You’d think it would be easier to implement that in a text editor. Maybe it hasn’t been necessary until recently?

Several text editors load only pieces of the file rather than the whole thing. Unfortunately, it’s been a while since I needed to do that and I can’t recall which ones.

Can you give an example of the kind of editing you need to do?

Basically just print a few lines to the screen, and then based on what I see, delete a few other lines. I could write a program to do this, but I was thinking maybe there was some command line utility that would take some equivalent of “print line 7” or “delete line 7” without having to load the whole file. ‘head -n7’ works for the former, so really I just need the latter.

Now that I think about it some more, the latter is going to be kinda slow even without loading the whole program all at once into a buffer, since you basically have to rewrite the whole file without the offending record, you can’t just “disappear” it from an existing file, since all the following records have to be “moved up” in the file.

But following freido’s advice I looked into ‘sed’ and ‘sed -i d7 file’ seems to work. Its still kinda slow, presumably because of what I mentioned before, it has to write out a whole seperate 4 gigabyte file with the deleted line missing. But it doesn’t just hang for whole minutes like ‘vi’ or ‘ed’ were doing, I’d guess because it does read and write line by line instead of trying to put the whole 4 gigbytes in RAM all at once and page faulting.

So problem solved, and thanks for the suggestions, though if someone has any other advice for editing text files piecemeal I’d still be interested in hearing it for future reference.

What are the criteria that you’re using to filter the lines? It’s quite possible you can use grep, sed or awk to automatically decide whether to delete lines or not.

Well, to be honest, I think you’re asking the wrong question. Querying and modifying flat files is slow and prone to disk I/O error. Problems like this are why databases were developed. The file structure of a database lets you easily insert or delete records without modifying gigabytes of disk storage, and query without having to grep through the whole file, probably much more quickly than you can do now.

If you don’t want to migrate all the way to a database, you can get some of the improvements by splitting your single huge file into multiple files (with /usr/bin/split), say a few megabytes each, possibly with an additional index file telling which records are where. Then each edit has to rewrite only a few megabytes instead of gigabytes, and likewise a failed edit will probably only corrupt one file instead of the whole thing.

Rysto has a good question. If there’s some logic to what happens to the lines, we could hammer out a shell script, but if you’re stuck using sed or head or whatever, here’s a few commands that might help.

head -n7 gives you 7 lines, right? If you just want one line, try sed -n ‘7p’ file, or sed ‘7q;d’ file may be faster. For multiple lines, sed -n ‘7,10p’ file or sed -n ‘11q;7,10p’ file may be faster but is a little more complicated. You could also try awk ‘NR==7’ file or awk ‘NR==7,NR==10’ file, but I have no idea if they’d be any faster on a huge file.

I would set up temporary aliases or short scripts to save retyping commands, like for displaying a line:

#! /bin/sh
sed ‘${1}q;d’ file

Name it “l” and type “l 4000” in the terminal to print line 4000.

I trust you will back up this ginormous file first. Forgetting the 7 in “sed -i d7” will delete all lines.

I deal with immense files all the time. One thing to do is to gzip them, and then use gzcat to pipe them into sed or Perl or whatever. You can grep them directly with gzgrep. You can also read them directly into Perl easily enough also, and write them in gzipped format also, though I haven’t had to do that part.

Try a different editor. I’ve opened files of several hundred megabytes in Emacs and XEmacs and it’s never taken more than a second or two to load; after that editing was just as fast as with any other file.

0.1 GB is a lot different from 4 GB, given the typical amount of physical memory in today’s machines (few GB).

Yes, and 0.1 GB is a lot different from “several hundred megabytes”, which is what I wrote.

However, given how fast it takes XEmacs to open such files (compared with some other editors), it seems reasonable to assume that it’s not actually slurping the whole thing into memory. In light of this I bet it would be just as competent opening a 4 GB file as a 1 GB file, no matter how much physical memory you have.

You are correct, I should have said ~0.5 GB. My statement still holds, though, that moving from 0.5 GB to 4 GB crosses an important memory threshold.

XEmacs does pull the full file into memory. To confirm this, I just opened a one-million byte text file (~95 MB), and the XEmacs image size at startup was 108 MB (103 MB resident). If I selected all text (C-x-h) and copied it to the kill ring (M-w), the image jumped to 205 MB (199 MB resident). Incidentally, when I tried opening a 200 MB file, XEmacs refused with “maximum buffer size exceeded”.

I think the difference isn’t the time to load the file into memory but that once you go above the available system memory, you start using virtual memory and having page fault errors which take much more time then symply using system memory. Its not simply that 4GB takes ten times as long to load as 400 MB, theres a definate discontinutiy at some point close to my available system memory. But I tried Emacs with similar results, so I don’t think its an editor specific problem, in anycase. Thanks for the advice, though.

Yea, in the long run I definately need to change the program that creates the files so that it generates something more tractable then gigantic ascii files. But just for the job I was doing, I needed something ‘quick and dirty’.

Thats kinda neat. Does gzgrep actually search through the file while keeping it compressed or does it just make a temporary decompressed copy somewhere and run grep on it?

Just mentioning the Unix programs vi and edlin, which are single-line text editors that may be of value here. It’s been nearly 20 years since I used either, so I wouldn’t have a clue how to apply them to your needs, but looking them up in Unix how-to references may give some useful assistance.