I have a file of names with the following structure
NAME [tab] FREQUENCY
NAME NAME [tab] FREQUENCY
NAME NAME NAME [tab] FREQUENCY
i.e. more than one name is assigned the same frequency. An example will make this clear
SANDHYA DAS 6901
ARATI DAS 6201
KALPANA DAS 4714
GITA DAS 4550
BISWANATH DAS 3949
SWAPAN DAS 3941
SUKUMAR DAS 3876
GOPAL DAS 3835
SARASWATI DAS 3769
DILIP DAS 3653
TAPAN DAS 3607
ASHOKE DAS 3604
PRATIMA DAS 3558
PURNIMA DAS 3546
BASANTI DAS 3372
SHANKAR DAS 3279
SANDHYA GHOSH 3254
SANJAY DAS 3252
PRATIMA DAS 3212
KALPANA DAS 3203
ARATI GHOSH 3155
MALATI DAS 3151
SWAPAN DAS 3138
SANDHYA RANI DAS 3120
LAKSHMI DAS 3104
ANJALI DAS 3085
I want to assign the same frequency to both names or to all three names to ensure that statistically both or all three names within a field retain their frequency.
The expected output would be
ANJALI 3085
ARATI 6201
ARATI 3155
ASHOKE 3604
BASANTI 3372
BISWANATH 3949
DILIP 3653
GITA 4550
GOPAL 3835
KALPANA 4714
KALPANA 3203
LAKSHMI 3104
MALATI 3151
PRATIMA 3558
PRATIMA 3212
PURNIMA 3546
SANDHYA 6901
SANDHYA 3254
SANDHYA 3120
SANJAY 3252
SARASWATI 3769
SHANKAR 3279
SUKUMAR 3876
SWAPAN 3941
SWAPAN 3138
TAPAN 3607
DAS 3085
DAS 6201
DAS 3155
DAS 3604
DAS 3372
DAS 3949
DAS 3653
DAS 4550
DAS 3835
DAS 4714
DAS 3203
DAS 3104
DAS 3151
DAS 3558
DAS 3212
DAS 3546
DAS 3254
DAS 3120
DAS 3252
DAS 3279
DAS 3876
DAS 3138
DAS 3607
GHOSH 6901
GHOSH 3769
RANI 3941
DAS 3120
I am doing this field separation by means of a Macro in Excel but since the database is huge, the process is long and tedious.
Would it be possible to do the same with the help of a PERL/AWK script ? I already have written an awk tool to merge all frequencies, which I could use to merge the frequencies. Aa an example all occurencies of
DAS
would thus have a merged frequency.
I work under the Windows OS and UNIX (sigh) is not my OS. No shell scripts please.
Many thanks.
The following awk script provides two lists (with an empty line between them). The first list provides:
NAME<tab>FREQUENCY
linesfor all NAMES on an input line and the second list provides
NAME<tab>FREQUENCY[<space>FREQUENCY]...
for all FREQUENCY entries for each NAME entry found in the input.
awk '
{ for(i = 1; i < NF; i++) {
printf("%s\t%s\n", $i, $NF)
($i in a) ? a[$i] = a[$i] " " $NF : a[$i] = "\t" $NF
}
}
END { print ""
for(i in a) print i a
}' file
I know you're doing this on Windows, but if someone else wants to try it on a Solaris/SunOS system, they would need to use /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk instead of awk . With your sample input, the output produced is:
SANDHYA 6901
DAS 6901
ARATI 6201
DAS 6201
KALPANA 4714
DAS 4714
GITA 4550
DAS 4550
BISWANATH 3949
DAS 3949
SWAPAN 3941
DAS 3941
SUKUMAR 3876
DAS 3876
GOPAL 3835
DAS 3835
SARASWATI 3769
DAS 3769
DILIP 3653
DAS 3653
TAPAN 3607
DAS 3607
ASHOKE 3604
DAS 3604
PRATIMA 3558
DAS 3558
PURNIMA 3546
DAS 3546
BASANTI 3372
DAS 3372
SHANKAR 3279
DAS 3279
SANDHYA 3254
GHOSH 3254
SANJAY 3252
DAS 3252
PRATIMA 3212
DAS 3212
KALPANA 3203
DAS 3203
ARATI 3155
GHOSH 3155
MALATI 3151
DAS 3151
SWAPAN 3138
DAS 3138
SANDHYA 3120
RANI 3120
DAS 3120
LAKSHMI 3104
DAS 3104
ANJALI 3085
DAS 3085
MALATI 3151
DILIP 3653
GOPAL 3835
TAPAN 3607
GHOSH 3254 3155
PURNIMA 3546
BASANTI 3372
SANJAY 3252
LAKSHMI 3104
RANI 3120
ARATI 6201 3155
ASHOKE 3604
KALPANA 4714 3203
SUKUMAR 3876
ANJALI 3085
GITA 4550
SARASWATI 3769
BISWANATH 3949
PRATIMA 3558 3212
DAS 6901 6201 4714 4550 3949 3941 3876 3835 3769 3653 3607 3604 3558 3546 3372 3279 3252 3212 3203 3151 3138 3120 3104 3085
SHANKAR 3279
SWAPAN 3941 3138
SANDHYA 6901 3254 3120
PS Note that the output produced here seems to match your input (but even after sorting is VERY different from the output you said you wanted). As an example, your input contains the lines:
Many thanks it worked fast and zipped through over 700,000 records in no time.
The only hassle:
when a word has more than one occurence and therefore frequencies, all the frequencies belonging to that word are stored on one line.
example
How do I get the script to store these on separate lines. My frequency merge script accepts
word [tab] frequency
same word [tab] frequency
and merges them.
If it is not too much of a hassle could you please comment that code. I tried to modify the script but it mangled the results.
Many thanks once more
I'm sorry I confused you.
Please look at the output again! You will see two sets of output. The 1st set only contains
NAME<tab>FREQUENCY
exactly as you requested, but the output data matches your sample input instead of your sample output. And the output provided by my script prints entries in the ouput in the order they were found in the input file. (Your sample output seemed to be in a fairly random order and had FREQUENCY values for some NAMEs that were not present in your sample input.)
You said you had a second awk program that would give you a merged list of all frequencies associated with a NAME. The second part of the output produced by my awk script did that without needing a second script.
Looking at my script again:
awk '
{ for(i = 1; i < NF; i++) {
printf("%s\t%s\n", $i, $NF)
($i in a) ? a[$i] = a[$i] " " $NF : a[$i] = "\t" $NF
}
}
END { print ""
for(i in a) print i a
}' file
If you don't want the 2nd part of the output, remove the code shown in red. That will leave you with:
which prints a line for each field except the last from every input line in the file. Each output line will contain the name found in one of the 1st (NF - 1) fields ($i) from a line in the file and the frequency found in the last field ($NF) separated by a tab character.
The stuff that was in red created an array (a) where the array index was a NAME and the valie of a[NAME] is a list of the frequencies found in the input. The END clause in the awk script printed an empty line to separate the two parts of the output followed by the elements of the array giving the NAME and the list of frequencies found for that NAME in a random output order. (The frequencies for a given NAME in the output appear in the order in which the entries were found in the input file.)