Getting the most common column with respect another

teefa · August 20, 2013, 5:30am

hi all,

i want to get the most comon column w.r.t another

this is my file

Tom|london
Tom|london
Tom|Paris
Adam|Madrid
Adam|NY

the Output to get me :

Tom|london
Adamn|Madrid

ive tried

sort  -u -t"|" -k1,1  but it get  only uniq column with the first appearance not most repeated ***

Skrynesaver · August 20, 2013, 5:54am

perl -ne '($name,$location)=split/\|/,$_;$locations{$name}{$location}++; END{for $name (keys %locations){$max=0;for $location (keys %{$locations{$name}}){if ($locations{$name}{$location}>$max){$most_frequent=$location;$max= $locations{$name}{$location};}} print "$name|$most_frequent"}}' file

teefa · August 20, 2013, 6:40am

thanks alot its so fast ive tired it on a 15Million record file i got it in 3 min nearly but some numbers were wrong , can it be a sort problem , when i tried it on unsorted 100 number it got right answer but for the whole file it got mistakes ????

i Think it got me the first appearance only !!

krishmaths · August 20, 2013, 7:16am

awk '{++a[$1]} END{for(i in a){print i"|"a}}' inputfile | awk -F"|" '{if($3>b[$1]){b[$1]=$3;c[$1]=$2}} END{for(i in b){print i"|"c}}'

If you have two mappings for a given first field in input, say

Adam|Madrid
Adam|NY

then the output will be just Adam | Madrid . Not sure if this is what you wanted.

rdcwayx · August 20, 2013, 7:26am

:):):):):):)

krishmaths · August 20, 2013, 7:32am

This may not work if the input has

Tom|london
Tom|london
Tom|Paris
Adam|Madrid
Adam|NY
Tom|amsterdam
Tom|amsterdam
Tom|amsterdam
Tom|amsterdam

teefa · August 20, 2013, 8:12am

krishmaths
Registered User
krishmaths is active

Save

@Krish thanks hope it can be fast as perl was so fast , i deal with huge files Cant u Make me similar funcationality with perl or adjust the upper as it needs be so fast
@rdc u must make sort | uniq -c | sort -nr , and i takes alot of time while writing
and thanks alot

Ygor · August 20, 2013, 8:36am

Try...

$ cat file1
Tom|london
Tom|london
Tom|Paris
Adam|Madrid
Adam|NY
Tom|amsterdam
Tom|amsterdam
Tom|amsterdam
Tom|amsterdam

$ awk -F'|' '{c=++a[$1,$2];if(c>b[$1]){b[$1]=c;d[$1]=$2}}END{for(i in d)print i FS d}' file1 > file2

$ cat file2
Adam|Madrid
Tom|amsterdam

$

teefa · August 21, 2013, 2:51am

It got syntax error , kindly i need any script to make it within 6 min max as it depends on time for huge files

krishmaths · August 21, 2013, 3:35am

I was able execute Ygor's solution without any issues. Could you please post the error and mention the flavor of your OS?

teefa · August 21, 2013, 4:03am

im using solaries 10, sorry ive tried /usr/xpg4/bin/awk

at the End thanks all both perl and awk works correctly but perl much faster