Using columns from 2 files and extracting string

alpesh · October 19, 2011, 3:43pm

Hi All,

I have 2 files with a common column.

File 1 looks like

NAME   START     POS1  POS2  
N1          1234        1236    1237 
N2          1245        1248    1250
..
..

File 2 looks like

NAME   STRING  
N1         ABCDEFGH
N2         EFGHBCD
N3         PQRSSTUV
..
......

The string named N1 starts at position (1234+1)=1235 and continues upto 1242
and has 8 characters as described in the second file. I am interested in extracting the sub-string between 1236(POS1 for N1) and 1237(POS2 for N1) not including the start i.e (1237 for N1) and (1249 and 1250 for N2).
So the expected answer is D for N1 and HB for N2. The answer can appear as a new column in file1.

It seems fairly simple but I`m struggling to find the right commands.Please help.

Thanks,
Alpesh

radoulov · October 19, 2011, 5:16pm

Could you please elaborate more? How the length of the string ABCDEFGH and the positions POS1 and POS2 relate to each other?

alpesh · October 19, 2011, 5:32pm

the string ABCDEFGH starts from the start position column+1. So it starts from position 1235, the character A is at position 1235 and the last character H is at 1242. I want to extract the sub-string starting at POS1 and ending at POS2 not including POS1. So in this case the result would be the character in position 1237 which is D. For N2, the substring is between 1248 and 1250, basically 1249 and 1250 not including 1248, the result is HB.

radoulov · October 19, 2011, 5:46pm

If A is at position 1235, D is at position1238, not at 1237, or I'm missing something?

alpesh · October 19, 2011, 5:50pm

sorry ..you are correct !

vgersh99 · October 19, 2011, 5:51pm

I'm not sure if 'START' position really matters - you're extracting from POS1 to POS2 anyway. Or are POS1 and POS2 relative to the START?
something along these lines - debug with a sample file1:

N1          1234        2    5
N2          1245        3    6

nawk 'FNR==NR{fs[$1]=$3;fe[$1]=$4;next} $1 in fs {print substr($2,fs[$1],fe[$1]-fs[$1])}' file1 file2

radoulov · October 19, 2011, 5:56pm

This is what I get with your files:

awk 'NR == FNR {
  f2[$1] = $2; next
  }
$1 in f2 {
  print $0, substr(f2[$1], $3 - $2, $4 - $3) 
  }' file2 file1

% awk 'NR == FNR {
  f2[$1] = $2; next
  }
$1 in f2 {
  print $0, substr(f2[$1], $3 - $2, $4 - $3)
  }' file2 file1
NAME   START     POS1  POS2   
N1          1234        1236    1237  B
N2          1245        1248    1250 GH

You may need to adjust the numbers

Or this:

% awk 'NR == FNR {
  f2[$1] = $2; next
  }
$1 in f2 {
  print $0, substr(f2[$1], $3 - $2 + 2, $4 - $3)           
  }' file2 file1                        
NAME   START     POS1  POS2   
N1          1234        1236    1237  D
N2          1245        1248    1250 BC

alpesh · October 20, 2011, 3:10pm

Thanks radoulov and vgersh99. I will process my 200MB files and let you guys know if I have issues.

---------- Post updated 10-20-11 at 02:10 PM ---------- Previous update was 10-19-11 at 05:07 PM ----------

Hi All,

The code is running with the test dataset but not with the big ones.
Is there a way I could share a portion of my actual input files ?

vgersh99 · October 20, 2011, 3:19pm

You can attach files to your postings - it's a paper clip icon on the first row of icons.

alpesh · October 20, 2011, 3:41pm

Hi vgersh99 and radoulov,

I used the following code from radoulov but it returns me the contents of file 1 without adding the output column to it. Here $4 of file1 has to be matched with $1 of file2, $8 and $9 in file1 are the POS1 and POS2, $2 is the START column. $10 in file2 should be used to extract the sub-string.

awk 'NR == FNR {
  f2[$1] = $10; next
  }
$4 in f2 {
  print $0, substr(f2[$1], $8 - $2 , $9 - $8) 
  }' file2_truncated.txt file1_truncated.txt  | head

Thanks,
Alpesh

radoulov · October 20, 2011, 4:08pm

Execute dos2unix file2_truncated.txt file1_truncated.txt .
Try this and check the result:

awk 'NR == FNR {
  f2[$1] = $10; next
  }
$4 in f2 {
  print $0, substr(f2[$4], $8 - $2 , $9 - $8) 
  }' file2_truncated.txt file1_truncated.txt | 
    head

alpesh · October 20, 2011, 4:15pm

Thanks radoulov. Looks like I do not have dos2unix installed , I have to ask the admin to install.
Is there anything else I could try?

radoulov · October 20, 2011, 4:30pm

Yes, this should do the trick:

perl -i -pe's/\r\n/\n/g' file2_truncated.txt file1_truncated.txt

vgersh99 · October 20, 2011, 4:33pm

tr -d '\015' < myFile > myNewFile

alpesh · October 20, 2011, 5:33pm

outstanding !! thank you

alpesh · October 27, 2011, 1:03pm

Hi All,

Thanks for all the help. I am faced with a little complication now with respect to the last problem.The substring to be extracted must take care of the padded S values(column 6 in file2).
This column has values like 36M, 5S31M , 32M4S etc. The number associated with the leading S must be added to the starting position of the substring in the original string.So for 5S31M, 5 must be added to ($8-$2) of the print statement in the code. 32M4S should be ignored and treated as before, since it does not have a leading S. By leading S, I meant the character 'S' must be present at the beginning of the value of column 6, if the value is 2S30M4S, only the value 2 (associated with the leading S) and not 4 is to be considered for calculations. I hope I`m clear.

Thanks again,
Alpesh

---------- Post updated at 12:03 PM ---------- Previous update was at 09:52 AM ----------
This is what I came up with. Doesn't work, please help debug/change.

awk 'NR == FNR {
   f2[$1] = $10;
   f2[$2]=$6;next
   }
 $4 in f2 {
awk '{x=f2[$6];gsub(/[0-9]+[^0-9S]/,z);p=$1+0};
   print $0 "\t" substr(f2[$4], $8 - $2 +1 + $p, $9 - $8)
   }'  file2_truncated.txt file1_truncated.txt  | head

alpesh · October 28, 2011, 3:13pm

Hi radoulov and vgersh99,

Can you help me out please?

Thanks,
Alpesh

vgersh99 · October 28, 2011, 4:27pm

not tested

awk 'NR == FNR {   
    f2[$1] = $10
    f2pad[$1]=(match("^[0-9]+S", $6))?(substr($6,1,RLENGTH-1)):0
    next   
  } 
  $4 in f2 {
      print $0, substr(f2[$4], $8 - $2 + f2pad[$4], $9 - $8)    
}' file2_truncated.txt file1_truncated.txt |      head

radoulov · October 29, 2011, 4:02am

Just to add that if you post an example of the expected output, based on the provided input, you'll probably receive a quicker answer.

alpesh · November 3, 2011, 11:36pm

Hi radoulov,

I will try to explain my question with two examples. Sorry if its a lengthy read, I`m sure the answer will take you much less time than reading the question. This is a continuation of the substring code that you helped me with earlier. I have attached file1 and file 2 samples for testing.

File2$6 can have 'M' and 'S' along with other alphebets and numbers.There can be no S, or max 2 S. There has to be at least 1 'M' and at most two 'M's. As a rule , we ignore the S and take the M.
S will only be present at the beginning and/or ending of $6 and not the middle.
example 23S4M9S, 1S34M1S, 34M2S are valid but 23M1S12M is invalid.

Lets take an example of file1$4=SNPSTER1_0001:7:60:876:131#0/1

So if file2$6 is 20M769N15M1S for $10 string ATAGCCAATATCCCCAACAGGTTGAGGGAACTGTTT
,we divide it into 4 segments.

s1=0=first 0 characters to ignore since there is no leading S
s2=1=last 1 character to ignore = T
m1=20=first 20 characters after s1,ATAGCCAATATCCCCAACAG
m2=15=last 15 characters before s2, GTTGAGGGAACTGTT

So we have 2 strings(m1 and m2), and substring is to be extracted from one of them based on the following condition.

if (file2$4+s1+m1) > file1$9
choose string m1 for substring operation
else
choose string m2 for substring opeartion

Substring operation

when file2$1=file1$4,

print file1$0 , the substring of m1 or m2 with parameters file1$8 - file1$2 + 1, file1$9 - file1$8

In this case file1$2=15735490
file1$8=15735496
file1$9=15735497

         file2$4=15734702

(file2$4+s1+m1)=15734722 is less than file1$9=15735497

so we choose m2=GTTGAGGGAACTGTT for substring operation.

answer = substring (GTTGAGGGAACTGTT,7,1) = G

##########################################################

Another example

file1$4=file2$1=SNPSTER1_0001:7:115:1082:672#0/1
file2$10 = ATCTTGGGCCGCGAGCATCTTCAACCGCAAAATTTG

file 2$6=1S24M186N11M

s1=1 ignore first character 'A'
s2=0
m1=24 , TCTTGGGCCGCGAGCATCTTCAAC
m2=11, CGCAAAATTTG

In this case file1$2=4044310
file1$8=4044316
file1$9=4044317

         file2$4=4044311

Here (file2$4+s1+m1)=4044336 > file1$9=4044317

So we choose m1=TCTTGGGCCGCGAGCATCTTCAA for subtring operation

answer = substring (TCTTGGGCCGCGAGCATCTTCAA ,7,1) = G

############################################################

Thanks,
Alpesh