Matching multiple fields from two files and then some?

mbp · June 18, 2012, 3:49am

Hi,
I am working with two tab-delimited files with multiple columns, formatted as follows:

File 1:

  >chrom 1       100     A          G          20       �(10 columns)
  >chrom 1       104     G          C          18       �(10 columns)
  >chrom 2       28       T          C          44       �(10 columns)
  etc.

File 2:

  >chrom 1       200     269     333     396     �(variable, odd number of columns)
  >chrom 2       15       114     207     273     400     496     �(variable, odd number of columns)
  etc.

I am trying to determine a way (in Unix/Linux) if I can do the following:

1) print all lines from file 1 where:
a) the entries in column 1 match for both file 1 and file 2 AND
b) the number in column 2 of file 1 is within 1000 of any of the numbers (i.e. column 2 onwards) in the matching line in file 2 from part �a�.

2) print all lines from file 1 where:
a) same as �a� above AND
b) the number in column 2 of file 1 is equal to or between the numbers in columns 2 and 3, 4 and 5, 6 and 7, etc. of file 2, for as many pairs of numerical columns that there are for that particular line of file 2.

I have a feeling I might be in over my head here, but any help would certainly be appreciated. Is it possible that this can be done with awk? Thanks!

ps.
the files are both currently sorted on the first column

Chubler_XL · June 18, 2012, 9:27pm

How about these:

1)

awk -F'\t' 'FNR==NR {
  Rng[$1]=$2","$3
  for(i=4;i<NF;i+=2) Rng[$1]=Rng[$1]","$i","$(i+1)
  next }
($1 in Rng){
  c=split(Rng[$1],v,",");
  for(i=1;i<c;i++) 
    if($2 >= v-1000 && $2 <= v+1000) { print; next }
}' file2 file1

2)

awk -F'\t' 'FNR==NR {
  Rng[$1]=$2","$3
  for(i=4;i<NF;i+=2) Rng[$1]=Rng[$1]","$i","$(i+1)
  next }
($1 in Rng){
  c=split(Rng[$1],v,",");
  for(i=1;i<c;i+=2) 
    if($2 >= v && $2 <= v[i+1]) { print; next }
}' file2 file1

mbp · June 18, 2012, 11:21pm

Hi Chubler,
thanks very much for that valiant effort. However, unless I am missing something I don't think it is working yet. I put both commands into a shell script and tested it on two abbreviated files that should produce output if the commands are working (I also embedded an echo 'hello' in the script to make sure the script itself was put together correctly). Either with output to stdout or directed to a file, I don't catch any lines. I'll keep checking to make sure I have things set up correctly on my end. Any other thoughts?

Thanks again!

Chubler_XL · June 19, 2012, 8:25pm

Seems to be working on your test files OK (see transcript below).

Perhaps your actual data dosn't match the posted testfiles?

$ cat file1
chrom 1 100     A       G       20      ...(10 columns)
chrom 1 104     G       C       18      ...(10 columns)
chrom 2 28      T       C       44      ...(10 columns)
$ cat file2
chrom 1 200     269     333     396
chrom 2 15      114     207     273     400     496
$ awk -F'\t' 'FNR==NR {
>   Rng[$1]=$2","$3
>   for(i=4;i<NF;i+=2) Rng[$1]=Rng[$1]","$i","$(i+1)
>   next }
> ($1 in Rng){
>   c=split(Rng[$1],v,",");
>   for(i=1;i<c;i+=2) 
>     if($2 >= v && $2 <= v[i+1]) { print; next }
> }' file2 file1
chrom 2 28      T       C       44      ...(10 columns)

mbp · June 19, 2012, 10:27pm

Aha, you were absolutely correct! I apologize, as I did have a formatting issue in the first column of the test files. Once I fixed that, the awk command works perfectly. Really amazing stuff, and thanks so much again!

mbp