How to paste multiple files in parallel?

Hi all,

I am trying to paste thousands of files together into a matrix. Each file has only 1 column and all the files have the same number of rows (~27k rows). I tried
paste * > output as well as some other for loops
but the output only contains the columns from the 1st and last files. The format of the files are as followed. It has a header which is identical to the file name:

12345
0.0
0.0
0.0

...

Please help!

Welcome to the forum.

I can't reproduce the behaviour you describe. Please show what happens / where it fails.

Thank you.
Please see below for my effort trying to merge the three .tsv but fail.

-bash-4.2$ ll
total 2692
-rwxrwx--- 1 chenx302 1000000 136602 Mar 25 16:24 235423.tsv
-rwxrwx--- 1 chenx302 1000000 136587 Mar 25 16:24 263428.tsv
-rwxrwx--- 1 chenx302 1000000 136597 Mar 25 16:25 291417.tsv
-rwxrwx--- 1 kelkay01 1000000     12 Mar 28 10:50 f1.txt
-rwxrwx--- 1 kelkay01 1000000     12 Mar 28 10:50 f2.txt
-rwxrwx--- 1 kelkay01 1000000     12 Mar 28 10:50 f3.txt
-rwxrwx--- 1 chenx302 1000000 393561 Mar 21 11:16 geneID
-rwxrwx--- 1 chenx302 1000000 409786 Mar 27 18:35 new
-bash-4.2$
-bash-4.2$ paste *tsv | head
235423  291417
0.0     0.0
0.0     0.0
0.0     0.0
0.0     0.0
0.0     0.0
0.0     0.0
0.0     0.0
0.0     0.0
0.0     0.0
 -bash-4.2$

Run the command again with the -x (--xtrace) option set, to see how the shell expands / interprets your command.
Run the command paste 235423.tsv 263428.tsv 291417.tsv
Run file *.tsv and post the output.
Run od -tx1c 235423.tsv (and other files) and post (a reasonable part of) the outputs.

I agree with RudiC that we need to see the first few lines of your input files. From the output you have shown us, it would seem that the most crucial would be:

for file in *.tsv
do	echo "File: $file:"
	head "$file" | od -t1xc
done

Since you haven't bothered to tell us what operating system you're using, if od complains about unknown options, try od -bc instead of od -tx1c .

Before we see the output from the above commands, would anyone care to guess which of these files have DOS <CR><LF> line separators instead of UNIX line terminators? Unfortunately, even if this is the problem, I'm not seeing the output I would have expected.

1 Like

Hi Don,

Thanks for taking a look at this. Below is the output from the code:

File: 235423.tsv:
0000000  32  33  35  34  32  33  0d  0a  30  2e  30  0d  0a  30  2e  30
          2   3   5   4   2   3  \r  \n   0   .   0  \r  \n   0   .   0
0000020  0d  0a  30  2e  30  0d  0a  30  2e  30  0d  0a  30  2e  30  0d
         \r  \n   0   .   0  \r  \n   0   .   0  \r  \n   0   .   0  \r
0000040  0a  30  2e  30  0d  0a  30  2e  30  0d  0a  30  2e  30  0d  0a
         \n   0   .   0  \r  \n   0   .   0  \r  \n   0   .   0  \r  \n
0000060  30  2e  30  0d  0a
          0   .   0  \r  \n
0000065
File: 263428.tsv:
0000000  32  36  33  34  32  38  0d  0a  30  2e  30  0d  0a  30  2e  30
          2   6   3   4   2   8  \r  \n   0   .   0  \r  \n   0   .   0
0000020  0d  0a  30  2e  30  0d  0a  30  2e  30  0d  0a  30  2e  30  0d
         \r  \n   0   .   0  \r  \n   0   .   0  \r  \n   0   .   0  \r
0000040  0a  30  2e  30  0d  0a  30  2e  30  0d  0a  30  2e  30  0d  0a
         \n   0   .   0  \r  \n   0   .   0  \r  \n   0   .   0  \r  \n
0000060  30  2e  30  0d  0a
          0   .   0  \r  \n
0000065
File: 291417.tsv:
0000000  32  39  31  34  31  37  0d  0a  30  2e  30  0d  0a  30  2e  30
          2   9   1   4   1   7  \r  \n   0   .   0  \r  \n   0   .   0
0000020  0d  0a  30  2e  30  0d  0a  30  2e  30  0d  0a  30  2e  30  0d
         \r  \n   0   .   0  \r  \n   0   .   0  \r  \n   0   .   0  \r
0000040  0a  30  2e  30  0d  0a  30  2e  30  0d  0a  30  2e  30  0d  0a
         \n   0   .   0  \r  \n   0   .   0  \r  \n   0   .   0  \r  \n
0000060  30  2e  30  0d  0a
          0   .   0  \r  \n
0000065

What happens if you try:

for i in *.tsv
do	tr -d '\r' < "$i" > "$i.nocr"
done
paste *.nocr | head
1 Like

There you are - DOS line terminators (<CR> = \r = ^M = 0x0D). Those cause the combined lines to wrap back to the left margin for every column. The idiosyncratic output seen in post #3 comes from the <TAB> following the <CR> shifting all consecutive output to the first <TAB> position, overwriting files 2 till n-1, and leaving the final filen column. For a proof, try with longer column elements, or with a different paste delimiter ( -d option).

Remove the <CR> chars with e.g. the dos2unix command, or sed 's/\o015//' .

2 Likes

It worked! Amazing! Thank you guys very much!