Hi guys, I need help from the Linux gurus, or any guru, really. I have a specialized task that I think could be accomplished with a script, but I’m not sure. Ok here it goes.
So here’s the situation. In a folder I’ve got a bunch of files that look like this…
filename_RECID1
filename_RECID2
filename_RECID3
etc.
The “filename” part is the same for every file. Only the RECID changes. These files are very short and they just consist of 2 lines of text. The first line are header titles, and the 2nd line is data.
I need to combine all of these into a single file just called “filename”, so what I need is something to go through, read each file, capture the entire 2nd line of each file, and append it to a new file named “filename”.
This should be trivially easy to do, but first a question:
Do you need the files to be read and combined in any particular order?
You could do it with 1 (one) simple Linux command . . .
Oops… Just re-read your OP. You want only the second line of the file?
I was going to write:
cat filename* > new_file_name
but that gets both lines of all files. To get just the second line, and assuming that each file really does have EXACTLY two lines, no more and no less, try this:
tail -1 filename* > new_file_name
This reads the last 1 line of each file and writes them all on one standard-output stream which goes to new_file_name – Be sure you choose a name for new_file_name that CANNOT be mistaken for one of the input files, or you may have trouble with this.
If tail -1 doesn’t work (I’m pretty sure it does, across all known implementations), you might need to use: tail -n 1 filename* instead.
ETA: I asked above, does the order matter, and forgot to discuss that. By using the wildcard:
filename*
you will get the files in the order that the shell decides for you, which will be alphabetical order. If you need to order them according to some other rule, then you will need some extra work to deal with that.
I see that even as I am typing this addendum, someone has come in with another way also.
My one-line method works providing the expanded wild-card list (that is, all your file names listed on one line one after another with one blank space separating them) does not exceed some maximum length, possibly around 1000, or maybe much longer with modern shells.
If you have a bazillion files, however, you need SmartAlecCat’s solution.
The order of the files doesn’t matter, but if it does it alphabetical that’s great. I can’t wait to give this a try. I’ll come back with results in a bit! Gotta create all my header files first
That should be easy with ANY text editor, of which there are many from which to choose. If you are using Linux with Gnome, you should have gedit – that is probably among the easiest, most Windows-Notepad-like editors.
Let’s make sure I got your question right:
(a) Line #1 in every file is a header line.
(b) Line #1 in every file is identical to Line #1 in every other file?
(c) So you want to put just ONE copy of Line #1 from any one of these files at the beginning of your combined file?
Is this just a one-time job? If so, just use an editor. With gedit and any other GUI-style editor, you should be able to open your big combined file in one tab, open one of the source files in another tab, and just cut-and-paste a line.
Is this a job that you want to automate so you can do it over and over?
Okay, if you want the entire task automated, like if you are going to need to do something like this over and over, I think these two lines will do it:
head -n 1 `ls filename* | head -n 1` > output_file
tail -q -n 1 filename* >> output_file
Note carefully! Those two single-quote-like marks in the first line are left-single-quotes, also known as accent grave – the character on the same key as the tilde ~ on most keyboards.
ETA: Now, are you wanting to do this, say, every day, with a new set of data files? Will you want to use a different output file every day? If so, you might want to make it into a script with a command-line parameter to name the output file:
#!/bin/bash
head -n 1 `ls filename* | head -n 1` > $1
tail -q -n 1 filename* >> $1
Oh, I see from an earlier post…
You DID say you want to automate this completely, since you will do it repeatedly.
So use the method shown just above, and consider the script idea where you can give the output name as a command-line parameter.
To run it: Suppose you put the above script into a file named catdata (You might be able to cut-and-paste it directly from the above post!)
And suppose, to keep it simple, it’s in the current directory. (We Linux folks don’t call it a folder.)
Then, do this: chmod 755 catdata (You only need to do this once after you create the file, to make it executable.)
To run it, type: ./catdata my_new_file
where my_new_file is the name of the new output file you’d like to create for the day’s run.
What happens to the original files after you’ve done this with them? You could add to the script to delete them. Or, you could move them to a different directory for safekeeping, leaving the current directory devoid of your source files, ready to begin collecting tomorrow’s new files.
You wrote that all your source files have names like:
filename_RECIDx
and you want to put them into an output file named filename
This requires a bit of care. The script I showed above uses the wildcard filename* which will take all files beginning with filename – If the output file begins with those same characters, that will cause trouble.
If the input files are all like filename_RECIDx (including that underscore character) and the output file will be filename (without the underscore), then include the underscore in the wildcard, in both places where it appears in the script. That is, filename_* instead of just filename*
And DON’T include the underscore in the output file name.
Will you have a different filename part of those names for every day’s run? Or will that be the same every day? Will the combined output file be the same name every day?
This does not rely on globbing (wildcard expansion by the shell), and this should work on an arbitrary number of matching files. So, you avoid the “too many arguments” issue.
It will also include the first two lines of the first matched file, then only the second line of subsequent files, as you wanted.
It decides whether to include the first two lines of matched files based on the existence of the output file. If it doesn’t exist, it will create it and include the first two lines of the file. If it already exists, it will append the second line of the current file. So, remove the output file before re-running the command.
ETA: This also avoids the possibility of unintentionally processing the output file as an input file, even if the naming conventions are similar. The ‘-and -not -name “filename”’ accomplishes that.
Thank you so much you guys! You are all life savers. I have it all implemented now and working quite well. You are a wealth of information and I really really cannot tell you how much I appreciate the help and the explanations. And all the options! So nice.