Problem counting unique disks/slices

jontjioe · September 8, 2011, 11:29am

I want to create a unique listing of slices/disks from a large list that will have duplicates. Here is a sample of the input file.

#array.txt
Disk4:\s93
Disk4:\s93
Disk4:\s94
Disk4:\s95\s96\s97
Disk4:\s93
Disk4:\s95\s96\s103
Disk4:\s93
Disk4:\s93
Disk4:\s95\s96\s105
Disk4:\s93
Disk4:\s95\s96\s105
Disk4:\s93
Disk4:\s93
Disk4:\s93
Disk4:\s93
Disk4:\s93
Disk4:\s93
Disk4:\s93
Disk4:\s93
Disk4:\s95\s96\s106

I think the following command should give me what I want. It should output the unique disk/slice name in the first column and how many times that disk/slice occurred in the list in the 2nd column.

cat array.txt | awk 'count[$1]++ END {for (i in count) print i, count}' > array.txt_unique

However, only some of the lines in the output file are correct. Sometimes it seems that it is not matching the disk name and it thinks it is different when in reality it is the same. See my sample output file below.

#array.txt_unique
Disk4:\s93
Disk4:\s93
Disk4:\s93
Disk4:\s93
Disk4:\s93
Disk4:\s95\s96\s105
Disk4:\s93
Disk4:\s93
Disk4:\s93
Disk4:\s93
Disk4:\s93
Disk4:\s93
Disk4:\s93
Disk4:\s93
Disk4:\s95\s96\s103 1
Disk4:\s95\s96\s105 2
Disk4:\s95\s96\s106 1
Disk4:\s95\s96\s97 1
Disk4:\s93 14
Disk4:\s94 1

I also tried removing the colons and back slashes prior to counting the unique slices, but I had the same result. Can someone help me with this?

Thanks in advance,
Jonathan

ahamed101 · September 8, 2011, 11:42am

sort yourfile | uniq

Is this what you want?

--ahamed

shamrock · September 8, 2011, 1:33pm

awk '{x[$0]++}END{for(i in x) print i,x}' array.txt

jontjioe · September 9, 2011, 8:47am

Shamrock and Ahamed,

Thank you both for your help!

Shamrock,
I think the code you provided is the same as what I originally tried that only seems to work some of the time:

cat array.txt | awk 'count[$1]++ END {for (i in count) print i, count}' > array.txt_unique

Ahamed, your snippet works, but I was having trouble fitting it into the larger task I was trying to accomplish.

Let me explain my problem in full. I have a trace file I will use as input that looks like the following:

#input.txt
53600.88  "Disk4:\s93" 129048320 16 0
53601.96  "Disk4:\s93" 100679424 8 0
53602.16  "Disk4:\s94" 14080 1 0
53603.97  "Disk4:\s95\s96\s97" 95010560 128 0
53614.06  "Disk4:\s93" 129052416 16 0
53616.24  "Disk4:\s95\s96\s103" 204544 128 0
53620.87  "Disk4:\s93" 100679424 8 0
53623.21  "Disk4:\s95\s96\s105" 11179776 128 0
53624.2  "Disk4:\s93" 100681472 8 0
53628.79  "Disk4:\s95\s96\s105" 11179776 128 0
53629.91  "Disk4:\s93" 100679424 8 0
53641.74  "Disk4:\s95\s96\s106" 20336384 8 0
53643.65  "Disk4:\s93" 100679424 8 0
53647.63  "Disk4:\s95\s96\s107" 124010240 64 0
53649.5  "Disk4:\s93" 100679424 8 0
53653.25  "Disk4:\s95\s96\s108" 60641024 8 0
53656.19  "Disk4:\s95\s96\s97" 95010560 8 0
69015.39  "Disk4:\s152\s153" 81643264 16 0
88588.57  "Disk4:\s172\s173" 72648448 16 1
103611.34  "Disk4:\s93" 129062656 16 0
103612.41  "Disk4:\s93" 100681472 8 0
103917.55  "Disk4:\s172\s173" 115363584 16 1
113755.24  "Disk4:\s252" 113782528 8 0
113755.76  "Disk4:\s253\$UsnJrnl:$J" 22150912 21 0

I would like to convert all of the disk information in the 2nd column to be a device number like 0, 1, 2, etc. I don't care which device number gets assigned to which string. It just has to be unique and match to the original.

So...

"Disk4:\s93" can simply become 0
"Disk4:\s94" can simply become 1
"Disk4:\s95\s96\s97" can simply become 2

For example, I want my completed output file to look like this:

#output.txt
53600.88  0 129048320 16 0
53601.96  0 100679424 8 0
53602.16  1 14080 1 0
53603.97  2 95010560 128 0
53614.06  0 129052416 16 0
53616.24  3 204544 128 0
53620.87  0 100679424 8 0
53623.21  4 11179776 128 0
53624.2  0 100681472 8 0
53628.79  4 11179776 128 0
53629.91  0 100679424 8 0
53641.74  5 20336384 8 0
53643.65  0 100679424 8 0
53647.63  6 124010240 64 0
53649.5  0 100679424 8 0
53653.25  7 60641024 8 0
53656.19  2 95010560 8 0
69015.39  8 81643264 16 0
88588.57  9 72648448 16 1
103611.34  0 129062656 16 0
103612.41  0 100681472 8 0
103917.55  9 115363584 16 1
113755.24  10 113782528 8 0
113755.76  11 22150912 21 0

Hopefully this explains what I'm trying to do better. Thank you!

ahamed101 · September 9, 2011, 9:27am

awk '{gsub(/\\|\$/,"#");if($2 in a){}else{a[$2]=i++;}gsub($2,a[$2])}1' infile

53600.88  0 129048320 16 0
53601.96  0 100679424 8 0
53602.16  1 14080 1 0
53603.97  2 95010560 128 0
53614.06  0 129052416 16 0
53616.24  3 204544 128 0
53620.87  0 100679424 8 0
53623.21  4 11179776 128 0
53624.2  0 100681472 8 0
53628.79  4 11179776 128 0
53629.91  0 100679424 8 0
53641.74  5 20336384 8 0
53643.65  0 100679424 8 0
53647.63  6 124010240 64 0
53649.5  0 100679424 8 0
53653.25  7 60641024 8 0
53656.19  2 95010560 8 0
69015.39  8 81643264 16 0
88588.57  9 72648448 16 1
103611.34  0 129062656 16 0
103612.41  0 100681472 8 0
103917.55  9 115363584 16 1
113755.24  10 113782528 8 0
113755.76  11 22150912 21 0

--ahamed

jontjioe · September 9, 2011, 9:34am

Ahamed,

This is awesome! It is exactly what I needed. Thank you so much! So I noticed the weird special characters in the last line too. The file is an excerpt from a well known trace file. It is probably incorrect, but as long as I can uniquely identify it as a disk, it will still work for me.

Thank you!
Jonathan