Can someone help me disect these UNIX commands?

I’m trying to find an easy way to grab numbers from a file and put them into another file that can be read by excel directly. I’ve done something like this once before, when I was processing thousands of NMR integrations. I’m hoping I can restructure the same commands to do what I want, but this stuff is over my head. Here are the original commands:

grep 11.0993 t* | awk ‘{print $5 ;}’ > ortho.txt

grep 3.69197 t* | awk ‘{print $5 ;}’ > BezV.txt
I can tell you that 11.0993 is the beginning of the integration field in the NMR file. Same with 3.69197. An example of one of the files looked like this:

diphenyl trial2 spec- 2

region start ppm end integral
1 11.0993 10.8871 0.00198048
2 8.19681 8.04883 0.00190193
3 7.81661 7.67546 0.0015819
4 7.54984 7.42598 0.0968807
5 7.32884 6.70227 0.249264
6 6.52742 6.3647 0.0246814
7 3.69197 3.56007 0.0255509
8 3.34185 3.20996 0.00218045
9 2.17248 2.01948 0.0431081
10 1.69599 0.694921 0.545598
11 0.310231 0.109143 0.00727177
Somehow those UNIX commands would take the integral from a hundred different files like this and put it into a nice neat text file that Excel would easily import.

Well it looks like you’ve got something pretty close. The first part, grep 11.0993 t* , will find all lines that contain 11.0993 in every file that begins with t in your current directory, and output the lines. Those lines are then piped to awk, which by default splits input on spaces and is asked to print the fifth column. But it looks like your files only have four columns. Do you mean print $4?

The last part redirects the results of awk to a file.

“grep 11.0993 t*” will search files that have names starting with t, and print any lines containing the text “11.0993”.

“awk ‘{print $5 ;}’” will print the 5th word on each of those matching lines (where words are separated by whitespace).

“> ortho.txt” writes the result to a file named ortho.txt.

That seems slighly incompatible with your example, as it has only 4 items on each numeric line. Perhaps there’s an empty item that doesn’t show correctly as formatted here. Or maybe you need to change it to ‘print $4’ instead.

You can see what output the command produces by removing the “> ortho.txt” part while you’re experimenting.

I read the OP as being a known-good example, and my guess was that the lines all started with a space. Thus, when grep prepends each output line with “filename1:”, you gain a field.

Yes, it looks like the OP’s lines have a space in front of them (visible by quoting the post), so the $5 is certainly correct.

(Well, as long as there is more than one file matching t*… if there is only one file that matches, grep will not prepend the filename to the line…)

In the one minute I have before I head out the door, I wanted to add…

WarmNPrickly: note that you can add bits of the command one at a time to get a feel for what they do. So, run this:

grep 11.0993 one_file.txt

and see what you get. Then:

grep 11.0993 t*

Then add the awk part:

grep 11.0993 t* | awk ‘{print $5 ;}’

Note that ‘11.0993’ can appear anywhere in a line and grep won’t care, so if it’s possible that that string appears somewhere else (and if you don’t want to grab those lines), you’ll need to do something else, perhaps:

cat t* | awk ‘{if ($2==11.0993) print $4;}’

Here, “$2” means “second field” (or, “second word in the line”). The ‘cat’ part just prints the contents of the files matching t*. (Awk also could take the input file list as an argument, but this version is closer to what you’ve seen already.)

Are you grabbing these numbers from a single file? Or from several? The ‘t*’ in your example means to match all files which begin with the letter t. The parsing commands you need will depend on how many files you need to search. Which matches your situation?

  1. I only have one file and I will specify the exact name, or
  2. I have multiple files and I need to search all of them, or
  3. Sometimes I have one file and sometimes I have more than one file to search.

I am grabbing these numbers from several files. In this case each file will have 10 numbers to grab. 9 of them will be in sequential rows in the same place, the 10th will be elsewhere. I have to go for the moment, but I can give an example of the actual file I want to grab them from when I get back tonight.

Thank you all so far.


grep 11.0993 t* | awk '{print $5 ;}' > ortho.txt

grep 3.69197 t* | awk '{print $5 ;}' > BezV.txt

You could recode both of these commands more simply like this:


awk '/11.0993/ { print $5 }' t* > ortho.txt

awk '/3.69197/ { print $5 }' t* > BezV.txt

The text in the slashes in my awk invocations means ‘on each line that contains this pattern, run the following command’. awk is a very pattern-driven language; there are even special patterns BEGIN and END, which mean ‘run this code before processing the input’ (usually used to set variables) and ‘run this code after processing the input’ (usually used to do some final output), respectively.

You’re probably using ‘GNU awk’ (to find out, type ‘awk --version’ on the command line); if you are, reading the manual will likely be very useful for you. (I suggest the ‘one page per node’ HTML version for online reading, and perhaps the PDF or PostScript version for downloading and printing.)

The “.” character matches any character in grep. so “grep 11.0993” will not just find lines that contain 11.0993, it will also find lines that contain 1140993, or 11x0993, or 11 0993, etc. You should use

grep "11\.0993" 

instead.

Are you trying to extract the integral from one specific resonance out of multiple spectra? If you want to do that you just have to adjust the number (like the 11.0993) to the appropriate chemical shift and maybe adjust the t* for the actual file names. Then execute the command.

The integration interval has to be exactly the same for all spectra, if the beginning number is even slightly different it won’t be recognized.

This goes for my awk-only version as well, but we’d need to see the data to verify whether this nitpick actually matters.