Using awk to append incremental numbers to the end of duplicate file names.

Ethereal · June 20, 2012, 8:08am

I'm currently working on a script that extracts files from a .zip, runs an sha1sum against them and then uses awk to pre-format them into zomething more readable thusly:

Z 69 89e013b0d8aa2f9a79fcec4f2d71c6a469222c07 File1
Z 69 6c3aea28ce22b495e68e022a1578204a9de908ed File2
Z 69 54122c1890d94ff76578076018d00a7c16e86e6d File4
Z 69 e66b24788b24a1733778d983d8163bb02a5c7201 File5
Z 69 c738aae20b50658ada2e0d5144302b9b695f98f3 File6
Z 69 40a63fe5078f5481737d4c0a7cbdca775cf9af28 File4
Z 69 4754cfd9f38d8a0beba4c8ebbf83569736c7ebfc File4
Z 69 721e29ad87bc1fccb3e0cae987b81a21a941e408 File2
Z 47 ca926c99b0394a8783152c19a7ffc6d686dcbcb3 File1
Z 47 1d66b78cc8b67e848aba4eecc3cb0f32e81b4004 File2
Z 47 114e0653503340c7325b345642fb61afddf4f339 File3
Z 47 4754cfd9f38d8a0beba4c8ebbf83569736c7ebfc File4
Z 47 86a09ee704fde229b0b3b25fe6190179e334364c File3
Z 47 7ba8ae667e48f53ccffb52a308340726e446712a File5

The problem I am having is that since this file operates on zips sometimes it will result in files with duplicate names appearing in a specific subset of files. For example subset 69 has File4 appear three times and File2 twice, while subset 47 has File3 appear twice. what I am trying to accomplish is to make awk check for duplicate occurances of $4 within a $2 subset and then use '{ count=1, count++ }' to append incrementally increasing numbers to the end of those files, thus making them uniqe, Any help or hints on how I might accomplish this would be appreciated since I'm pulling my hair out over this.

bartus11 · June 20, 2012, 8:19am

Can you post desired output for this sample data?

Ethereal · June 20, 2012, 9:04am

The code below is pre-formatted with awk I am jsut struggling with the rename part, but ideally $count would reset to 1 every time a new value is picked up in $2 thusly:

Z 69 89e013b0d8aa2f9a79fcec4f2d71c6a469222c07 File1
Z 69 6c3aea28ce22b495e68e022a1578204a9de908ed File2_1
Z 69 54122c1890d94ff76578076018d00a7c16e86e6d File4_1
Z 69 e66b24788b24a1733778d983d8163bb02a5c7201 File5
Z 69 c738aae20b50658ada2e0d5144302b9b695f98f3 File6
Z 69 40a63fe5078f5481737d4c0a7cbdca775cf9af28 File4_2
Z 69 4754cfd9f38d8a0beba4c8ebbf83569736c7ebfc File4_3
Z 69 721e29ad87bc1fccb3e0cae987b81a21a941e408 File2_2
Z 47 ca926c99b0394a8783152c19a7ffc6d686dcbcb3 File1
Z 47 1d66b78cc8b67e848aba4eecc3cb0f32e81b4004 File2
Z 47 114e0653503340c7325b345642fb61afddf4f339 File3_1
Z 47 4754cfd9f38d8a0beba4c8ebbf83569736c7ebfc File4
Z 47 86a09ee704fde229b0b3b25fe6190179e334364c File3_2
Z 47 7ba8ae667e48f53ccffb52a308340726e446712a File5

Failing that count could just increment indefinately so you end up with File3_55 File3_56 ect...

bartus11 · June 20, 2012, 9:11am

Try:

awk '{$4=$4"_"++a[$4]}1' file

elixir_sinari · June 20, 2012, 9:18am

Something like this?

awk '{if(++a[$2,$4]>1)$4=$4"_"a[$2,$4]}1' inputfile

Ethereal · June 20, 2012, 9:54am

Thanks for the quick response guys elixirs is a closer to what I need, but about 3x as complex . As near as I can tell though you basically did what I wasted about half a day trying to make work by putting the variables into an associative array, but you seem to have used a multi-dimensional array which I'm not overly familiar with. To me it looks like you're creating a numbered array, incrementing it with ++, adding the values of $2 and $4 into it, then stating that if that array is > 1 let $4 = $4 appended with _ and the number value of the array, is that right?