What is happening to these data files? (Linux/OSX)

There are some files I need to work with, which are currently on a Linux computer. I wanted to transfer them over to my computers, Macs both, to work with them more easily. But somewhere in the process, the files are getting mangled in a very peculiar way, and I’d like to figure out how.

The original files are ASCII text, organized as a tab-delimited table. The first column is an index number, the second column is a radius value, the third column is a density, and so on. The files have no particular extension. When they get mangled, however, the result appears to be that each column of values ends up as its own block of text, followed by a block of text for the next column, and so on.

The first time I tried copying the files, I (sitting at the linux computer) did something like


scp -R * me@mycomputer:path/

The result was that all of the data files got mangled in the above-described manner. Once I realised that the files were getting mangled, I went back over to the Linux computer and did some experiments. If I copied over a single file, it still got mangled. If I copied over the file as-is, and then added the extension .txt at the other end (either as part of the copy command, or as a separate command afterwards), it still got mangled.

However, if I first re-named the file on the Linux box to have a .txt extension, and then transfered it over to the other computer, it was not mangled. Aha, I conclude, there must be some bug in scp, that it didn’t know what to do with this ASCII file that wasn’t called .txt . So I tar together all of the files on the Linux computer, and then use scp to transfer over the whole .tgz file. Surely, scp can’t mess that up. And yet, when I now expand that .tgz file back out, lo and behold, the data files are messed up, in the same way as before. Now, I know that it can’t have been scp’s fault, because it was a compressed tar file (and yes, I confirmed that it really was compressed), so scp couldn’t possibly have been doing anything to the innards of any particular file. Had it done anything untoward to the .tgz file, then it’d render it unopenable, or at least scrambled into something indistinguishable from random noise.

So now, I’m forced to conclude that either there’s the same weird bug involved in both scp and tar, or there’s something weird about the transition of these files from one system to the other. But what, and what can I do to fix it? Incidentally, any fix absolutely must be automated, since there are tens of thousands of separate files involved here.

I would suspect the tools you’re using to view the files on either end before I’d suspect old reliables like scp and/or tar. Have you tried viewing them with something simple like ‘od’ to verify that the contents are what you think they are?

A) What are you using to view the files on OS X?
B) What do you mean by “Block of text.”
C) Look at the files with BBEdit and see what the invisibles say.

There’s probably an end-of-line mismatch going on somewhere. Unix uses LF, the Mac classically used CR, and Windows uses CR/LF. Now that Mac OS is Unix based, it’s supposed to use LF, but older tools may only understand CR.

What are you using to view the files on your Mac. The problem is most likely to be related to CR/LF (carriage return/line feed) and TAB codes, and the application you use to open the file.

Linux file editors use a single <LF> to mark a line end in an ASCII file, DOS/Windows uses <CR><LF>. Some applications expand <TAB> to spaces.

All these things can conspire to create confusion. And your scp application may do some translation stuff too.

In fact, I think that is likely the case. Adding the .txt on the Linux side tells scp to treat the file as ASCII, so it (or the receiving end) translates <LF> to <CR><LF>, which works. Tar does not do any translation, it treats all the files as binary, and scp treats the tar file as binary, too, so no translation occurs.

So your solution is to rename and scp the files individually - easy to do in a bash script. Or get a text editor on the Mac that handles the <LF> formatted files (EMACs or vi probably have ports to the Mac).

Si

Yeah, this sounds like a newline character convention problem.

If I’m not mistaken, SCP is kind of like FTP in that it has a binary copy mode and a text copy mode. Probably what happened is that when you transfered the text files (uncompressed) without an extension, your SCP client transferred them as raw binary, which doesn’t modify the newline encoding. Ditto for transferring them while inside a tarball, obviously. When you transfered them uncompressed with a .txt extension, it probably triggered your SCP client’s text file detection, causing it to transfer the file in text mode instead, which does modify the newline encoding to whatever is suitable for the destination system’s OS.

The text viewer you’re using on your Mac(s) apparently doesn’t like Unix-style text files. Try a different text editor, or explicitly tell your SCP client to transfer the files in ASCII mode.

It’s not an issue in the program I’m opening the files in. TextEdit, pico, and emacs all give the same behaviour.

It’s also not entirely a LF/CR issue: The contents of the files are actually being re-ordered. The original file starts off something like


    1  1.000000E+02   1.70000 36.9268  6.9374 10.0000  3.5661   0.0000
    2  5.000000E+02  -4.80012 36.9269  6.9375  9.9250  3.5662  46.2058
    3  9.120000E+02  -4.53910 36.9267  6.9374  9.8802  3.5660  45.9116
    4  1.324000E+03  -4.37721 36.9265  6.9374  9.8499  3.5658  45.7053
    5  1.756600E+03  -4.25442 36.9260  6.9373  9.8264  3.5653  45.5423
    6  2.210830E+03  -4.15454 36.9254  6.9371  9.8073  3.5647  45.4070
    7  2.687771E+03  -4.06970 36.9246  6.9369  9.7912  3.5640  45.2822
    8  3.188560E+03  -3.99550 36.9237  6.9367  9.7771  3.5630  45.1734
    9  3.714388E+03  -3.92921 36.9226  6.9364  9.7647  3.5619  45.0789
   10  4.300131E+03  -3.86561 36.9212  6.9360  9.7529  3.5605  44.9905


while the mangled file goes something like


** NEUTRON STAR COOLING (FP) ** 1.4 M : SUPER s : NOMAG : TOKYO ps              
    0    0    1    0    0   10    0    2   -2    0    3    2  300    0
    2    1    0    0    0    1    0    2    1    0   10  156  186    0
    1   11    5   18    0    7    1    0    0    5    0    0    0    1
    1  187    1    0    0    0    3  300
 1.7000000 0.2377340 0.7727632 0.0200000 1.0000000 0.0000000 0.0000000
10.000000020.0000000-9.9999905-9.9999900-1.0000000 1.0000000 4.0000000
 0.0001000-0.1000000 0.0500000 1.0000000 0.1000000 0.1000000-0.0000087
 0.1200000 0.0000000 0.0000000 0.900000099.9990005 0.0100000 0.0400000
 0.0500000 1.0000000 0.0200000 0.6020000 0.2500000 0.0000000 0.0000000
  0.000000E+00  1.000000E+02  0.000000E+00  0.000000E+00  0.000000E+00
  0.000000E+00  0.000000E+00  0.000000E+00  1.083457E+01  5.131276E+02
26 565.8460E+17
 0   0.00000         1
  -9.99999      -3.31229      -2.71451      -2.41757      -2.12063    
  -1.92577      -1.73091      -1.53605      -1.39351      -1.25096    
  -1.11277     -0.974580     -0.887020     -0.799460     -0.735560    
 -0.671660     -0.629295     -0.591160     -0.548612     -0.510670    
 -0.490529     -0.471560     -0.451385     -0.432450     -0.402538    
 -0.375480     -0.360853     -0.347000     -0.332343     -0.318510    
 -0.298733     -0.280540     -0.269609     -0.259220     -0.248266    
 -0.237890     -0.228346     -0.219268     -0.210631     -0.202410    
 -0.194484     -0.186926     -0.179717     -0.172840     -0.158060    
 -0.143270     -0.131450     -0.119620     -0.101880     -0.858000E-01
 -0.729300E-01 -0.632800E-01 -0.536400E-01 -0.490600E-01 -0.444900E-01
 -0.407500E-01 -0.370100E-01 -0.309000E-01 -0.281800E-01 -0.267851E-01


…OK, and now I just did an slogin to the Linux computer, and the original files I’m seeing there also appear to have the same thing going on. I am now officially very confused.