awk - Counting number of similar lines

dhanamurthy · May 13, 2008, 11:26pm

Hi All

I have the input file OMAK_11.

OMAK 000002EXCLUDE 1341
OMAK 000002EXCLUDE 1341
OMAK 000002EXCLUDE 1341
OMAK 000003EXCLUDE 1341
OMAK 000003EXCLUDE 1341
OMAK 000003EXCLUDE 1341
OMAK 000004EXCLUDE 1341
OMAK 000004EXCLUDE 1341
OMAK 000004EXCLUDE 1341
OMAK 000004EXCLUDE 1341
OMAK 000005EXCLUDE 1341
OMAK 000005EXCLUDE 1341
OMAK 000005EXCLUDE 1341

I want the output as

OMAK EXCLUDE 000002 3 1341
OMAK EXCLUDE 000003 3 1341
OMAK EXCLUDE 000004 4 1341
OMAK EXCLUDE 000005 3 1341

I have this program
which is doing quite well. Except for the last line where i could not get any output. There is something to do with END of awk.

awk '{ curr=substr($0,1,11)

     if \( curr != prev && prev != ""\)
     \{
      a=sprintf\("%s %-50s %6s %-6s %s",substr\(prev\_0,1,5\),substr\(prev\_0,12,29\),substr\(prev\_0,6,6\),count,substr\(prev_0,41,4\)\)
     print a
      count=0   
     \}
    count\+\+
    prev=curr
    prev_0=$0
    \} END \{a=sprintf\("%s %-50s %6s %-6s %s",substr\($0,1,5\),substr\($0,12,29\),substr\($0,6,6\),count,substr\($0,41,4\)\)

print a
}' OMAK_11

Can any one tell me how to fix this?

Regards
Dhana

Annihilannic · May 13, 2008, 11:33pm

Incidentally, instead of a=sprintf(...); print a you can just use printf(...).

Maybe $0 is undefined when you reach the end clause... if you cange print a to print $0 in that last section does it print the last line of input?

Annihilannic · May 13, 2008, 11:36pm

By the way, you could also use uniq -c and rearrange the order of the output columns using awk.

shamrock · May 14, 2008, 12:41am

[n]awk '{
  c[$0]++
  split($2, m, /[A-Z]+/)
  split($2, n, /[0-9]+/)
  a[$1" "n[2]" "m[1]]=c[$0]" "$3
} END {for(i in a) print i, a}' file

dhanamurthy · May 14, 2008, 11:36pm

Hi
Thanks for the information provided.
I read the source code that you have proivded. For eg I have the below said data.

SIZEC000002EXCLUDE 1341
SIZEC000002EXCLUDE 1341
SIZEC000002EXCLUDE 1341
SIZEC000003EXCLUDE 1341
SIZEC000003EXCLUDE 1341
SIZEC000003EXCLUDE 1341
SIZEC000004EXCLUDE 1341
SIZEC000004EXCLUDE 1341
SIZEC000004EXCLUDE 1341
SIZEC000004EXCLUDE 1341
SIZEC000005EXCLUDE 1341
SIZEC000005EXCLUDE 1341
SIZEC000005EXCLUDE 1341

I have two questions
a] What is the purpose of having these statements if input is the above said data

split($2, m, /[A-Z]+/)
split($2, n, /[0-9]+/)
as $2 will not have any values of alphabets.
OR is it necessary to have both m and n.

b] If i have the below data

SIZEC000004EXCLUDE 1380
SIZEC000004EXCLUDE 1382
SIZEC000005EXCLUDE 1340
SIZEC000005EXCLUDE 1341
SIZEC000005EXCLUDE 1342

I want to group the datas like the below

SIZEC000004EXCLUDE 1380 1382
SIZEC000005EXCLUDE 1340 1341 1342

Is awk having any standard functions to do it.

Regards
Dhana

Annihilannic · May 15, 2008, 12:42am

Use an array indexed by $1, and append $2 to it as you process each line.

dhanamurthy · May 15, 2008, 1:25pm

Hi All

I have the input file as

INFOR00028114 GRAINS BAKERY 4000
INFOR00028114 GRAINS BAKERY 4000
INFOR00028114 GRAINS BAKERY 4000
INFOR0009183-RIVERS - IC 2672
INFOR0009183-RIVERS - IC 2672
INFOR0009183-RIVERS - IC 2672
INFOR0009183-RIVERS - IC 2671

I want the output like
BRAND 14 GRAINS BAKERY 000281 3 4000
BRAND 3-RIVERS - IC 000918 1 2671
BRAND 3-RIVERS - IC 000918 3 2672
BRAND 5 STAR 001972 2 3618

The Layout would be like
postion 1-5 for NAME1
position 6-6 for NAME2
position 12-41 for NAME3
position 42-46 for NAME4

I framed the below logic but i am getting the output like
BRAND 14 GRAINS BAKERY 000281 3 4000
BRAND 3-RIVERS - IC 000918 1 2671
BRAND 5 STAR 001972 2 3618
which is not that expected.

awk '{
c[$0]++
a=substr($0,1,5)
b=substr($0,12,30)
ff=substr($0,6,6)
d=substr($0,42,4)
j[a" "b" "ff]=c[$0]" " d
}END {for(i in j) print i, j[i]}' tes|sort

I am not sure what needs to be changed.
Can any one help me?

Regards
Dhana

matrixmadhan · May 15, 2008, 1:39pm

this should have been in a new thread.

Mods - could you please make this as a new thread

summer_cherry · May 16, 2008, 6:00am

sed 's/E/ E/' file | awk '{
sum[$2]++
}
END{
for (i in sum)
print "OMAK EXCLUDE "i" "sum" "1341
}'