Read column and find differences...

empyrean · October 29, 2012, 7:25pm

I have this file

427	A	C	A/C	12
436	G	C	G/C	12
445	C	T	C/T	12
447	A	G	A/G	9
451	T	C	T/C	5
456	A	G	A/G	12
493	G	A	G/A	12

I wanted to read the first column and find all other ids which are differences less than 10.

427	A	C	A/C	12	436
436	G	C	G/C	12	427,445
445	C	T	C/T	12	436,447,451
447	A	G	A/G	9	445,451,456
451	T	C	T/C	5	445,447,456
456	A	G	A/G	12	451,447
493	G	A	G/A	12

The last column should be like the above. All id's which are + or - 10 bases apart from that specific id. For example for 436, the boundaries are {426 - 446} other id's which are in that range are 427 and 445 so i displayed them in 6th column..

agama · October 29, 2012, 9:12pm

Assuming no duplicate field 1 values, and that all will fit in memory this should work:

awk '
    { a[$1+0] = $0; }
    END {
        for( x in a )
        {
            printf( "%s", a[x] );
            sc = " ";
            for( i = x-10; i <= x + 10; i++ )
                if( i != x  &&  i in a )
                {
                    printf( "%s%d", sc, i );
                    sc = ", ";
                }
            printf( "\n" );
        }
    }
' infile

Chubler_XL · October 29, 2012, 9:13pm

How about:

awk 'FNR==NR{k[$1];next}
{v=x;
 for(i=$1-10;i<=$1+10;i++) if(i!=$1&&i in k) v=v","i;
 $(NF+1)=substr(v,2)} 1' OFS="\t" infile infile

empyrean · October 29, 2012, 9:22pm

@agama : Thankyou .. the code works great but only thing is that its not printing in ascending order.

---------- Post updated at 09:22 PM ---------- Previous update was at 09:21 PM ----------

@Chubler_XL : Thank you. this works great !!! Can you explain me how this works as i wanted to understand the code..

Chubler_XL · October 29, 2012, 9:41pm

It makes two passes of the file (this is why filename is passed on commandline twice).

First pass stores all the IDs (field 1) in k[]:

FNR==NR{k[$1];next}

The Second pass blanks the string v then builds it up with all the ids within 10 of the current (excluding the current line of course):

v=x; for(i=$1-10;i<=$1+10;i++) if(i!=$1&&i in k) v=v","i;

This value v is then stripped of the first comma and added as a new field on the end of the line

$(NF+1)=substr(v,2)

1 this is a true expression and will cause awk to print the current line (ie the line that was just appended with v's contents).

OFS="\t" sets output fieldsep to TAB

durden_tyler · October 30, 2012, 12:22am

$
$
$ cat f13
427     A       C       A/C     12
436     G       C       G/C     12
445     C       T       C/T     12
447     A       G       A/G     9
451     T       C       T/C     5
456     A       G       A/G     12
493     G       A       G/A     12
$
$
$
$ perl -lane '$x{$F[0]} = [ @F ];
              END {
                foreach $k (sort keys %x) {
                  foreach $i ($k-10..$k+10) {
                    push (@y, $i) if defined $x{$i} and $i != $k;
                  }
                  printf ("%-7s %-7s %-7s %-7s %-7s %s\n",@{$x{$k}},join(",",@y));
                  @y=()
                }
              }' f13
427     A       C       A/C     12      436
436     G       C       G/C     12      427,445
445     C       T       C/T     12      436,447,451
447     A       G       A/G     9       445,451,456
451     T       C       T/C     5       445,447,456
456     A       G       A/G     12      447,451
493     G       A       G/A     12
$
$
$

tyler_durden

pamu · October 30, 2012, 2:56am

awk 'FNR==NR{X[$1];next}
    {for(i in X)
    {if((i-$1)*(i-$1)<=100 && i != $1){a[$1]=a[$1]?a[$1]","i:$0"\t"i}
    }print a[$1]?a[$1]:$0
    }' file file

agama · October 30, 2012, 9:51am

Pipe it through sort.

awk '
    { a[$1+0] = $0; }
    END {
        for( x in a )
        {
            printf( "%s", a[x] );
            sc = " ";
            for( i = x-10; i <= x + 10; i++ )
                if( i != x  &&  i in a )
                {
                    printf( "%s%d", sc, i );
                    sc = ", ";
                }
            printf( "\n" );
        }
    }
' infile | sort