Summarize file with column matching

newbie83 · November 11, 2011, 2:24pm

Guys,

Please help me with this code. I have 2GB file to process and shell seems to be the best option. I am a biologist and though I can think of the logic, the commands are beyond me. Any help is greatly appreciated. Please look at the attched file and the requirement will be very clear.

I want to count rows from file2 which match columns from file 1 and group the rows.

1) FILE 1 : col 1 and 3 has to be matched with FILE 2:col 1 and 2.
2) When condition1 is satisfied, I need to count and separate rows in FILE:1 belonging to group1 or group 2.
Compare FILE 1 :col 3 to FILE 2 : cols 3 and 4, if they are of different lengths then
trim the last character from FILE1 col3 and compare.
If it matches with FILE2:col 3, then increment group 1.
If it matches with FILE2:col 4, then increment group 2.
If it does not match any, assign it to grp1 or grp2 whichever has the value blank,if none of the two is blank then ignore that row.

3) Do steps 1 and 2 for each value of FILE:1 col 2.

The string "random" in the attached file can be any non-blank string.

FILE:2

c1	1234	a t
c1	1534	a t
c1	1634	a t
c1	1654	a t
c1	2234	a t
c1	5678	g t
c1	91011	t a
c1	2444	taa blank
c1	5667	att blank
c1	34566	blank att
c1	36365	a t
c2	88777	G blank
c2	7455	T a		
c2	46445	g t
c2	74676	a c
c2	565455	c G


FILE:1
c1	g1	1234	a 
c1	g1	1234	t
c1	g1	1234	t
c1	g1	1234	a 
c1	g1	1234	a 
c1	g1	1234	a 
c1	g1	5678	g 
c1	g1	5678	C
c1	g1	5678	t
c1	g1	5678	t
c1	g1	5678	t
c1	g1	5678	g 
c1	g1	5678	g 
c1	g1	91011	t 
c1	g2	2444	random
c1	g2	2444	random
c1	g2	2444	random
c1	g2	2444	taa
c1	g2	2444	random
c1	g2	2444	taa 
c1	g2	5667	att
c1	g2	34566	random
c1	g2	36365	a 
c2	g3	88777	G 
c2	g3	88777	G 
c2	g3	88777	random
c2	g3	88777	G 
c2	g3	88777	G 
c2	g3	7455	T 		
c2	g4	46445	t
c2	g4	74676	c
c2	g4	74676	c
c2	g4	74676	a 
c2	g4	74676	a 
c2	g4	74676	c
c2	g4	565455	G
c2	g4	565455	G
c2	g4	565455	G


Expected output

c1	g1	8	5
c1	g2	5	4
c2	g3	5	1
c2	g4	2	7

radoulov · November 11, 2011, 4:59pm

I don't understand number 2:

Why 2 groups? How can we determine which group the records belong to?

newbie83 · November 11, 2011, 6:13pm

Radoulov,

Col3 in file2 indicates group 1, and Col4 indicates group 2. I need to match file1Col4
with col3 and col4 of file2 and check which one it matches.

The first record has a in file1col4 = grp1 value of a in file2col3
2nd second has t in file1col4 =grp2 value of t in file2col4

c1 g1 1234 a grp1
c1 g1 1234 t grp2
c1 g2 2444 random grp2
c1 g2 34566 random grp1

Also, the data is NOT case sensitive. G=g , AGtc = agTc

Thank you..

newbie83 · November 15, 2011, 4:21pm

Hi radoulov, Is my requirement clear now? Thanks a ton for your help.

radoulov · November 15, 2011, 5:39pm

Not yet ... blank is the string blank or a something else? What do you mean by:

Increment that group by one if the value is blank?

newbie83 · November 15, 2011, 5:47pm

it is the string 'blank'.
if the value is any random string that does not match with either group value,
then assign to group with value blank.

eg. grp1 = a, grp2=blank, value=t, then increment grp2 by 1

but for the following case ignore that record

eg. grp1 = a, grp2=b, value=t ... ignore record since there is no blank group

radoulov · November 16, 2011, 7:11am

I must admit that I still don't understand your requirement. We could start with the following script and try to debug/adapt it:

awk 'END {
  for (g in gc) {
    split(g, t, SUBSEP)
    print t[1], gn[t[1], t[2]], gc[g]
    }
  }
NR == FNR {
  k[$1, $3]
  v[$1, $3, tolower($4)]
  gn[$1, $3] = $2
  next
  }
($1, $2) in k {
  for (i = 2; ++i <= 4;) {
   if ($i == "blank") {
     gc[$1, $2, $i]++
     continue
     } 
   if (($1, $2, tolower($i)) in v || ($1, $2, tolower(substr($i, 1, length($i) - 1))) in v)
      gc[$1, $2, tolower($i)]++   
    }
  }' file1 file2

I suppose that it would be easier if you post bigger samples from both files and an example of the expected output based on those exact samples.