Assigning the same frequency to more than one words in a file

I have a file of names with the following structure

NAME [tab] FREQUENCY
NAME NAME [tab] FREQUENCY
NAME NAME NAME [tab] FREQUENCY

i.e. more than one name is assigned the same frequency. An example will make this clear

SANDHYA DAS	6901
ARATI DAS	6201
KALPANA DAS	4714
GITA DAS	4550
BISWANATH DAS	3949
SWAPAN DAS	3941
SUKUMAR DAS	3876
GOPAL DAS	3835
SARASWATI DAS	3769
DILIP DAS	3653
TAPAN DAS	3607
ASHOKE DAS	3604
PRATIMA DAS	3558
PURNIMA DAS	3546
BASANTI DAS	3372
SHANKAR DAS	3279
SANDHYA GHOSH	3254
SANJAY DAS	3252
PRATIMA DAS	3212
KALPANA DAS	3203
ARATI GHOSH	3155
MALATI DAS	3151
SWAPAN DAS	3138
SANDHYA RANI DAS	3120
LAKSHMI DAS	3104
ANJALI DAS	3085

I want to assign the same frequency to both names or to all three names to ensure that statistically both or all three names within a field retain their frequency.
The expected output would be

ANJALI	3085
ARATI	6201
ARATI	3155
ASHOKE	3604
BASANTI	3372
BISWANATH	3949
DILIP	3653
GITA	4550
GOPAL	3835
KALPANA	4714
KALPANA	3203
LAKSHMI	3104
MALATI	3151
PRATIMA	3558
PRATIMA	3212
PURNIMA	3546
SANDHYA	6901
SANDHYA	3254
SANDHYA	3120
SANJAY	3252
SARASWATI	3769
SHANKAR	3279
SUKUMAR	3876
SWAPAN	3941
SWAPAN	3138
TAPAN	3607
DAS	3085
DAS	6201
DAS	3155
DAS	3604
DAS	3372
DAS	3949
DAS	3653
DAS	4550
DAS	3835
DAS	4714
DAS	3203
DAS	3104
DAS	3151
DAS	3558
DAS	3212
DAS	3546
DAS	3254
DAS	3120
DAS	3252
DAS	3279
DAS	3876
DAS	3138
DAS	3607
GHOSH	6901
GHOSH	3769
RANI	3941
DAS	3120

I am doing this field separation by means of a Macro in Excel but since the database is huge, the process is long and tedious.
Would it be possible to do the same with the help of a PERL/AWK script ? I already have written an awk tool to merge all frequencies, which I could use to merge the frequencies. Aa an example all occurencies of

DAS

would thus have a merged frequency.
I work under the Windows OS and UNIX (sigh) is not my OS. No shell scripts please.
Many thanks.

The following awk script provides two lists (with an empty line between them). The first list provides:

NAME<tab>FREQUENCY

linesfor all NAMES on an input line and the second list provides

NAME<tab>FREQUENCY[<space>FREQUENCY]...

for all FREQUENCY entries for each NAME entry found in the input.

awk '
{       for(i = 1; i < NF; i++) {
                printf("%s\t%s\n", $i, $NF)
                ($i in a) ? a[$i] = a[$i] " " $NF : a[$i] = "\t" $NF
        }
}
END {   print ""
        for(i in a) print i a
}' file

I know you're doing this on Windows, but if someone else wants to try it on a Solaris/SunOS system, they would need to use /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk instead of awk . With your sample input, the output produced is:

SANDHYA	6901
DAS	6901
ARATI	6201
DAS	6201
KALPANA	4714
DAS	4714
GITA	4550
DAS	4550
BISWANATH	3949
DAS	3949
SWAPAN	3941
DAS	3941
SUKUMAR	3876
DAS	3876
GOPAL	3835
DAS	3835
SARASWATI	3769
DAS	3769
DILIP	3653
DAS	3653
TAPAN	3607
DAS	3607
ASHOKE	3604
DAS	3604
PRATIMA	3558
DAS	3558
PURNIMA	3546
DAS	3546
BASANTI	3372
DAS	3372
SHANKAR	3279
DAS	3279
SANDHYA	3254
GHOSH	3254
SANJAY	3252
DAS	3252
PRATIMA	3212
DAS	3212
KALPANA	3203
DAS	3203
ARATI	3155
GHOSH	3155
MALATI	3151
DAS	3151
SWAPAN	3138
DAS	3138
SANDHYA	3120
RANI	3120
DAS	3120
LAKSHMI	3104
DAS	3104
ANJALI	3085
DAS	3085

MALATI	3151
DILIP	3653
GOPAL	3835
TAPAN	3607
GHOSH	3254 3155
PURNIMA	3546
BASANTI	3372
SANJAY	3252
LAKSHMI	3104
RANI	3120
ARATI	6201 3155
ASHOKE	3604
KALPANA	4714 3203
SUKUMAR	3876
ANJALI	3085
GITA	4550
SARASWATI	3769
BISWANATH	3949
PRATIMA	3558 3212
DAS	6901 6201 4714 4550 3949 3941 3876 3835 3769 3653 3607 3604 3558 3546 3372 3279 3252 3212 3203 3151 3138 3120 3104 3085
SHANKAR	3279
SWAPAN	3941 3138
SANDHYA	6901 3254 3120

PS Note that the output produced here seems to match your input (but even after sorting is VERY different from the output you said you wanted). As an example, your input contains the lines:

SANDHYA GHOSH	3254
ARATI GHOSH	3155

but your output lines for GHOSH are:

GHOSH	6901
    and
GHOSH	3769

instead of:

GHOSH	3254
    and
GHOSH	3155

???

Many thanks it worked fast and zipped through over 700,000 records in no time.
The only hassle:
when a word has more than one occurence and therefore frequencies, all the frequencies belonging to that word are stored on one line.
example

DAS	6901 6201 4714 4550 3949 3941 3876 3835 3769 3653 3607 3604 3558 3546 3372 3279 3252 3212 3203 3151 3138 3120 3104 3085

How do I get the script to store these on separate lines. My frequency merge script accepts

word [tab] frequency
same word [tab] frequency

and merges them.
If it is not too much of a hassle could you please comment that code. I tried to modify the script but it mangled the results.
Many thanks once more

I'm sorry I confused you.
Please look at the output again! You will see two sets of output. The 1st set only contains

NAME<tab>FREQUENCY

exactly as you requested, but the output data matches your sample input instead of your sample output. And the output provided by my script prints entries in the ouput in the order they were found in the input file. (Your sample output seemed to be in a fairly random order and had FREQUENCY values for some NAMEs that were not present in your sample input.)

You said you had a second awk program that would give you a merged list of all frequencies associated with a NAME. The second part of the output produced by my awk script did that without needing a second script.

Looking at my script again:

awk '   
{       for(i = 1; i < NF; i++) {
                printf("%s\t%s\n", $i, $NF)
                ($i in a) ? a[$i] = a[$i] " " $NF : a[$i] = "\t" $NF
        }
}
END {   print ""
        for(i in a) print i a
}' file 

If you don't want the 2nd part of the output, remove the code shown in red. That will leave you with:

awk '{for(i = 1; i < NF; i++) printf("%s\t%s\n", $i, $NF)}' file 

which prints a line for each field except the last from every input line in the file. Each output line will contain the name found in one of the 1st (NF - 1) fields ($i) from a line in the file and the frequency found in the last field ($NF) separated by a tab character.

The stuff that was in red created an array (a) where the array index was a NAME and the valie of a[NAME] is a list of the frequencies found in the input. The END clause in the awk script printed an empty line to separate the two parts of the output followed by the elements of the array giving the NAME and the list of frequencies found for that NAME in a random output order. (The frequencies for a given NAME in the output appear in the order in which the entries were found in the input file.)

Please let me know if this still is not clear.

1 Like

Many thanks. The suggestion worked and above all thanks for having the patience and taking time off to explain the script.