Guys,
Please help me with this code. I have 2GB file to process and shell seems to be the best option. I am a biologist and though I can think of the logic, the commands are beyond me. Any help is greatly appreciated. Please look at the attched file and the requirement will be very clear.
I want to count rows from file2 which match columns from file 1 and group the rows.
1) FILE 1 : col 1 and 3 has to be matched with FILE 2:col 1 and 2.
2) When condition1 is satisfied, I need to count and separate rows in FILE:1 belonging to group1 or group 2.
Compare FILE 1 :col 3 to FILE 2 : cols 3 and 4, if they are of different lengths then
trim the last character from FILE1 col3 and compare.
If it matches with FILE2:col 3, then increment group 1.
If it matches with FILE2:col 4, then increment group 2.
If it does not match any, assign it to grp1 or grp2 whichever has the value blank,if none of the two is blank then ignore that row.
3) Do steps 1 and 2 for each value of FILE:1 col 2.
The string "random" in the attached file can be any non-blank string.
FILE:2
c1 1234 a t
c1 1534 a t
c1 1634 a t
c1 1654 a t
c1 2234 a t
c1 5678 g t
c1 91011 t a
c1 2444 taa blank
c1 5667 att blank
c1 34566 blank att
c1 36365 a t
c2 88777 G blank
c2 7455 T a
c2 46445 g t
c2 74676 a c
c2 565455 c G
FILE:1
c1 g1 1234 a
c1 g1 1234 t
c1 g1 1234 t
c1 g1 1234 a
c1 g1 1234 a
c1 g1 1234 a
c1 g1 5678 g
c1 g1 5678 C
c1 g1 5678 t
c1 g1 5678 t
c1 g1 5678 t
c1 g1 5678 g
c1 g1 5678 g
c1 g1 91011 t
c1 g2 2444 random
c1 g2 2444 random
c1 g2 2444 random
c1 g2 2444 taa
c1 g2 2444 random
c1 g2 2444 taa
c1 g2 5667 att
c1 g2 34566 random
c1 g2 36365 a
c2 g3 88777 G
c2 g3 88777 G
c2 g3 88777 random
c2 g3 88777 G
c2 g3 88777 G
c2 g3 7455 T
c2 g4 46445 t
c2 g4 74676 c
c2 g4 74676 c
c2 g4 74676 a
c2 g4 74676 a
c2 g4 74676 c
c2 g4 565455 G
c2 g4 565455 G
c2 g4 565455 G
Expected output
c1 g1 8 5
c1 g2 5 4
c2 g3 5 1
c2 g4 2 7