Combine columns from many files but keep them aligned in columns-shorter left column issue

isildur1234 · November 30, 2012, 6:37am

Hello everyone,
I searched the forum looking for answers to this but I could not pinpoint exactly what I need as I keep having trouble.
I have many files each having two columns and hundreds of rows.
first column is a string (can have many words) and the second column is a number.The files are tab separated.
Example of some rows of three files:

file1.txt

Column1 Column2
S1	31.37
S1	31.49
S1	31.74
S10	31.13
S10	31.10
S10	31.29
S11	29.49
S11	29.54
S11	29.25
S12	31.24
S12	31.05
S12	31.50
S13	32.48
S13	31.07
...

file2.txt

column 1   Column 2
1 CVS IN	23.14
1 CVS IN	23.24
1 CVS IN	23.24
1 CVS IP	21.53
1 CVS IP	21.40
1 CVS IP	21.36
1 WB IN	22.89
1 WB IN	23.02
1 WB IN	23.23
1 WB IP	26.70
1 WB IP	26.50
1 WB IP	26.67
2 CVS IN	23.34
2 CVS IN	23.34
2 CVS IN	23.14
2 CVS IP	21.38
2 CVS IP	21.49
2 CVS IP	21.54

file3.txt

Column 1 Column2
1 WB IN	23.45
1 WB IN	23.53
1 WB IP	24.55
1 CVS IN	23.62
1 CVS IN	23.46
I CVS IP	21.74
1 WB IN	23.33
1 WB IN	23.42
1 WB IP	26.24
1 CVS IN	23.71
1 CVS IN	23.44
I CVS IP	22.35
1 WB IN	24.75
1 WB IN	24.88
1 WB IP	25.50

I want to to combine the files into one.
I use the following code for the time being:

paste  $folder/*.txt  | column -s $'\t' -t > folder/output_biorad.gff

The problem is that the first file in the file is smaller than the other ones and the output, when the rows of the first file end gets misaligned.
The columns of the second file shift to the left, and all the others follow etc.
I have many files and this is getting very confusing for the users, who are biologists and do not want to manipulate datasets. When there is a shorter left column the right columns become missaligned.

This is an example output that shows the problem: the problem starts at the line that i show as <----problem. the column I CVS IP should be the third column but it shifts to second.
.......

8 WB IP	25.53	8 WB IN	22.98	7 WB IN	25.7	3 CVS IN
8 CVS IN	23.44	8 WB IP	28.75	7 WB IP	26.59	3 CVS IP
8 CVS IP	22.25	8 WB IP	28.83	7 WB IP	26.16	3 CVS IP
8 CVS IP	22.37	8 WB IP	29.05	7 WB IP	26.28	3 CVS IP
8 WB IN	24.47	I CVS IP	 	8 CVS IN	25.7	3 WB IN     
8 WB IP	26.05	I CVS IP	 	8 CVS IN	25.31	3 WB IN
8 WB IP	26	I CVS IP	 	8 CVS IN	26.11	3 WB IN
8 CVS IN	25.11	I CVS IP	 	8 CVS IP	25.95	3 WB IN
8 CVS IP	23.19	I CVS IP	 	8 CVS IP	25.48	3 WB IN
8 CVS IP	23.19	I CVS IP	 	8 WB IN	25.83	3 WB IN
8 WB IN	24.18	I CVS IP	 	8 WB IN	26.1	3 WB IP
8 WB IP	33.73	I CVS IP	 	8 WB IN	25.81	3 WB IP
8 WB IP	33.75	I CVS IP	 	8 WB IP	26.74	3 WB IP
8 CVS IN	24.24	I CVS IP	 	8 WB IP	26.73	3 CVS IN
8 CVS IP	23.22	I CVS IP	 	8 WB IP	26.75	3 CVS IN
8 CVS IP	23.83	I CVS IP	 	I CVS IP	 	3 CVS IN
	I CVS IP	 	I CVS IP	 	3 CVS IN	              <--problem
	I CVS IP	 	I CVS IP	 	3 CVS IN	 
	I CVS IP	 	I CVS IP	 	3 CVS IN	 
	I CVS IP	 	I CVS IP	 	3 CVS IP	 
	I CVS IP	 	I CVS IP	 	3 CVS IP	 
	I CVS IP	 	I CVS IP	 	3 CVS IP	 
	I CVS IP	 	I CVS IP	 	3 WB IN	 
	I CVS IP	 	I CVS IP	 	3 WB IN	 
	I CVS IP	 	I CVS IP	 	3 WB IN	 
	I CVS IP	 	I CVS IP	 	3 WB IN	 
	I CVS IP	 	I CVS IP	 	3 WB IN	 
	I CVS IP	 	I CVS IP	 	3 WB IN	 
	I CVS IP	 	I CVS IP	 	3 WB IP	 
	I CVS IP	 	I CVS IP	 	3 WB IP	 
	I CVS IP	 	I CVS IP	 	3 WB IP	 
	I CVS IP	 	I CVS IP	 	3 CVS IN	 
	I CVS IP	 	I CVS IP	 	3 CVS IN	 
	I CVS IP	 	I CVS IP	 	3 CVS IN	 
	I CVS IP	 	I CVS IP	 	3 CVS IN	 
	I CVS IP	 	I CVS IP	 	3 CVS IN	 
	I CVS IP	 	I CVS IP	 	3 CVS IN	 
	I CVS IP	 	I CVS IP	 	3 CVS IP	 
	I CVS IP	 	I CVS IP	 	3 CVS IP	 
	I CVS IP	 	I CVS IP	 	3 CVS IP	 
	4V CVS IN	 	I CVS IP	 	3 WB IN	 
	4V CVS IN	 	I CVS IP	 	3 WB IN	 
	4V CVS IN	 	I CVS IP	 	3 WB IN	 
	4V CVS IN	 	I CVS IP	 	3 WB IP

When the lines from the firs file end all utput is shifted to the left. How can I keep them aligned?
i looked everywhere and I have not been able to find out how to solve it.
I am not entirely familiar with awk, i use perl for my other scripting and i am a beginner in shell scripting.
Many thanks for your help in advance

rdrtx1 · November 30, 2012, 12:44pm

try:

awk '
{
(FNR>m)? m=FNR:0;
f[FILENAME]++ ? 0:fc++;
a[fc-1,FNR]=$0;
}
END {
  for (j=1;j<=m;j++) {
     for (i=0;i<fc;i++) {
       printf (a[i,j])? (a[i,j] "\t"):("\t\t");
     }
     print "";
  }
}
' file*.txt > out.xls

isildur1234 · December 3, 2012, 1:49am

Thank you so much. This is working and aligning the columns correctly!
If you have time, can youvery very briefly explain how it works, why your for statements manage to align the columns, especially what happens at the printf?

for (j=1;j<=m;j++) {
     for (i=0;i<fc;i++) {
       printf (a[i,j])? (a[i,j] "\t"):("\t\t");

awk seems to be a very powerful language for file manipulation. How did you learn it? Textbook or website?
Again, thank you so much.

rdrtx1 · December 3, 2012, 11:38am

awk '{
(FNR>m)? m=FNR:0;                                # use file record number as max record counter
f[FILENAME]++ ? 0:fc++;                          # set new file counter (++ fail indicates new file)
a[fc-1,FNR]=$0;                                  # store data in two column,row array (file, record)}
END {  
  for (j=1;j<=m;j++) {                           # record counter loop     
    for (i=0;i<fc;i++) {                         # file counter loop       
       printf (a[i,j])? (a[i,j] "\t"):("\t\t");  # print tab separated data or just tabs if empty     
    }
    print "";                                    # print new line character  
  }
}' file*.txt > out.xls

Learned by reading awk book and samples on the web. This site has most excellent solutions all over. Search forums for examples.

nex_asp · December 3, 2012, 11:34pm

awk: cmd. line:15: fatal: cannot open file `file*.txt' for reading (No such file or directory)

if you try 

a.txt b.txt > out.xls

problem is you need to align again manually...

isildur1234 · January 23, 2013, 8:09am

Right now my input to this code file*.txt > out.xls
outputs the files pasted together but the files are pasted randomly. I want to read the files from my folder alphabetical lexicographical order and paste them together.
is there a way to sort my input files alphabetically before I run this code?
I tried sort file*.txt but it doesnt work, nor does ls-1|folder.