Advanced: Sort, count data in column, append file name

Hi. I am not sure the title gives an optimal description of what I want to do. Also, I tried to post this in the "UNIX for Dummies Questions & Answers", but it seems no-one was able to help out.

I have several text files that contain data in many columns. All the files are organized the same way, but the data in the columns might differ. I want to count the number of times data occur in specific columns, sort the output and make a new file. However, I want check several files for the occurrence of the same data, count the number of times it occurs, append the file name to each one and make a new file sorted by the number of occurrences.

File 1:

xx xx xx aab rrt xx 
xx xx xx ccd bbt xx 
xx xx xx ggt iir xx

File 2:

xx xx xx ggt iir xx
xx xx xx ccd bbt xx

File 3:

 
xx xx xx aab rrt xx 
xx xx xx ggt iir xx 

First I made a modification to the files, individually (any better way?) to make the file name occur in the first column:

sed 's/^/File1\t/' file1.temp > 1.txt

This gives files with:

File1:

File1 xx xx xx aab rrt xx 
File1 xx xx xx ccd bbt xx 
File1 xx xx xx ggt iir xx 

File2:

File2 xx xx xx ggt iir xx
File2 xx xx xx ccd bbt xx  

File3:

File3 xx xx xx aab rrt xx 
File3 xx xx xx ggt iir xx 

Then I extracted the columns of interest and sorted them and made a new file:

awk '{print $1,$5,$6}' *.txt |sort -k2 > output.txt

The output.txt file could look like this:

File1 aab rrt 
File3 aab rrt 
File1 ccd bbt 
File2 ccd bbt 
File2 ggt iir
File3 ggt iir 
File1 ggt iir

Now, I want to count the number of times column 2 and column 3 are identical for every line and keep the first column information in the output file, separated by comma or similar. I want to result to be like this:

2 ccd bbt File1 
2 aab rrt File1,File3 
3 ggt iir File1, File2, File3

It would be good (but not a requirement) to have the last column in the final file to be sorted, lane1, lane2, lane3 etc. The lane* can also be separated by columns if that is easier.

So far I have tried to use:

awk '{print $1,$5,$6}' *.txt |sort -k2|uniq -f1 -c|sort -g > final_output.txt

However, I am not able to get the column data merged in the final output file. How should I go about to do that?

-James

Something like this:

$ cat file[123]
File1 xx xx xx aab rrt xx
File1 xx xx xx ccd bbt xx
File1 xx xx xx ggt iir xx
File2 xx xx xx ggt iir xx
File2 xx xx xx ccd bbt xx
File3 xx xx xx aab rrt xx
File3 xx xx xx ggt iir xx
$
$ perl -lane '$x{"$F[4] $F[5]"} .= "$F[0],"; END{for(keys %x){$x{$_}=~s/,$//;print "$_ $x{$_}"}}' file1 file2 file3
ggt iir File1,File2,File3
ccd bbt File1,File2
aab rrt File1,File3
$
1 Like

Wow! Thanks for the swift reply. I am almost there. However, upon running the perl script I got the following "comma" in the wrong place:

perl -lane '$x{"$F[4] $F[5]"} .= "$F[0],"; END{for(keys %x){$x{$_}=~s/,$//;print "$_ $x{$_}"}}' file1 file2 file3

Results in:

ggt iir File1,File2,File3
  ,
 ccd bbt File1,File2
aab rrt File1,File3

Also, I would like to have the count listed in the first column:

3 ggt iir File1,File2,File3
2 ccd bbt File1,File2
2 aab rrt File1,File3

Any way of implementing this? Do I need to run uniq -c before I run the perl script?

Hi

You have empty lines in the files that's why you have comma in the "wrong" place.
To add the count and have the output you want change the code to

perl -lane '$c{"$F[4] $F[5]"}++; $x{"$F[4] $F[5]"} .= "$F[0]," if $F[5]; END{for(keys %x){$x{$_}=~s/,$//;print "$c{$_} $_ $x{$_}"}}' file[123]
1 Like

can the order of file1, file2, file3 vary
for example output contain order like file2, file1, file3

Thanks, that did what I wanted. What I did myself yesterday before reading your reply was to run:

awk -F "lane" '{print NF-1}' perl_output_file > new_count_file

followed by:

paste new_count_file perl_output_file > final_output_with_count_file

But, the perl code is more impressive. From this expercise, being a biologist trying to do some simple bioinformatics, I really want to learn more Unix/script/shell programming. Wow, so powerful. :slight_smile:

awk 'BEGIN{i=1}{
			x=$1;
			$1=y;
			if(!match(c[$0],x))
				{
					if(c[$0])
						{
							c[$0]=substr(c[$0],1)","substr(x,1)
						}
					else
						{
							c[$0]=x
				};
					
				};
			if(a[$0])
				{
					a[$0]++
				}
			else
				{
					a[$0]=1;
					b=$0;
					i++
				}
		}
 END{for(k=1;k<i;k++){print a[b[k]],b[k],c[b[k]]}}'  filename

output is

2  aab rrt File1,File3
2  ccd bbt File1,File2
3  ggt iir File2,File3,File1

sort on column two if you need output sorted on column two.

This is also very interesting as it might be easier for me to modify. However, when running this it seems to output all columns, counting correctly the and adding the "lane#" to each new line. The data in my columns, as illustrated with x, does not just contain x. They might have any letter or charater. Is this the problem with the above awk command? How are the right columns (e.g. 4 and 5) selected printed in the awk command?

in x i am storing the name of file for particular row and y is nothing only used to make $1 value null.

Ok, but how is this awk command selecting which columns to include. From the script I cannot see how columns 4 and 5 are used to compare their values. I am sorry for all the questions, but trying to learn as much as possible. It might be that I want to include more columns later on and then it would be very useful that I can make small changes to the script. I am very happy with your time so far, so if you do not have time please feel free to let it slide :o

i am storing each unique column sum in associative array a[]. and in the end printing the sum of each unique occurrence. and unique column are stored in array b[] used in the end for printing unique column

Ok, I see. That is why it is not working for:

File1 bb xx xx aab rrt xx
File1 xx xx xx ccd bbt xx
File1 xx xx xx ggt iir xx
File2 xx xx xx ggt iir xx
File2 xx xx xx ccd bbt xx
File3 aa xx xx aab rrt xx
File3 xx xx xx ggt iir xx 

as it will print it as:

1  bb xx xx aab rrt xx File1
2  xx xx xx ccd bbt xx File1,File2
3  xx xx xx ggt iir xx File1,File2,File3
1  aa xx xx aab rrt xx File3

It will not compare only column 5 and 6 but all columns and hence say that 1st and 6th line in column 2 are different, even though I want to know when the lines are identical in column 5 and 6. From the above dataset I want to get, independent of what is in the other columns, specific columns e.g. 5 and 6:

2  aab rrt File1
2  ccd bbt File1,File2
3  ggt iir File1,File2,File3

Is it possible to specify, in the suggested awk command, which columns to compare and display/save?

I have seen the powerful perl program and would like to learn more.

Also, I would like to add a few more columns to the output file.

From the perl program given earlier:

perl -lane '$c{"$F[4] $F[5]"}++; $x{"$F[4] $F[5]"} .= "$F[0]," if  $F[5]; END{for(keys %x){$x{$_}=~s/,$//;print "$c{$_} $_ $x{$_}"}}'  file[123]

I decided to have a go (modification in bold):

perl -lane '$b{"$F[3]"};$c{"$F[4] $F[5]"}++; $x{"$F[4] $F[5]"} .= "$F[0]," if $F[5]; END{for(keys %x){$x{$_}=~s/,$//;print "$c{$_} $_ $x{$_} $b{$_}"}}'  file[123] > output.file

But this is not adding column 3 to the end of each line in the output file. What am I missing here? I have very little knowledge but would really like to know how to modify the program so I can add more columns to the output file, or perform other small changes. Again, thanks for the help so far.

try this

awk 'BEGIN{i=1}{
            x=$1;
            y=$5":"$6
            if(!match(c[y],x))
                {
                    if(c[y])
                        {
                            c[y]=substr(c[y],1)","substr(x,1)
                        }
                    else
                        {
                            c[y]=x
                };
                    
                };
            if(a[y])
                {
                    a[y]++
                }
            else
                {
                    a[y]=1;
                    b=y;
                    i++
                }
        }
 END{for(k=1;k<i;k++){split(b[k],d,":");print a[b[k]],d[1],d[2],c[b[k]]}}'  inputfile

for input

File1 bb xx xx aab rrt xx
File1 xx xx xx ccd bbt xx
File1 xx xx xx ggt iir xx
File2 xx xx xx ggt iir xx
File2 xx xx xx ccd bbt xx
File3 aa xx xx aab rrt xx
File3 xx xx xx ggt iir xx

and output is

2 aab rrt File1,File3
2 ccd bbt File1,File2
3 ggt iir File1,File2,File3

Thanks Raj.

After a lot of testing (I know close to nothing about Linux/Unix/awk) I managed to make some modifications to the script:

awk 'BEGIN{i=1}{
            x=$1;
            y=$5":"$6
            p="chromosome"$7"-""chromosome"$9;
            if(!match(c[y],x))
                {
                    if(c[y])
                        {
                            c[y]=substr(c[y],1)","substr(x,1)
                        }
                    else
                        {
                            c[y]=p" "x
                };
                    
                };
            if(a[y])
                {
                    a[y]++
                }
            else
                {
                    a[y]=1;
                    b=y;
                    i++
                }
        }
 END{for(k=1;k<i;k++){split(b[k],d,":");print a[b[k]],d[1],d[2],c[b[k]]}}'

These modifications enabled me to extract other columns of interest and add them to the output file. I am now so excited that I want to learn more. Each time I hit "enter" on my keyboard and see a new line knowing that the script has been run within milliseconds - I just get so impressed. Thanks for all the help, you have recruited a Windows user to shell programming and scripting in Linux. Awk awk :slight_smile: