Awk and duplicate lines - little complicated

shadowww · March 11, 2012, 9:04am

So I've got problem which continues on my previous one (from few months ago:
Unix Linux Community - Technical support for all Unix and Linux users ).

Good, proven, working solutions for that old problem are those:

awk '{cur=$0; gsub(/[^[:alnum:]]/, "", cur); if (!a[tolower(cur)]++) print}'

and

awk '{s=tolower($0);gsub("[^[:alnum:]]","",s);x=$0} END {for(i in x) print x}'

These 2 approaches yield same results (but with different final order of lines, which is really unimportant for me).
These lines (any of them) are also, what I need modified now to work a little different, and that is purpose of this new topic:

I now don't need awk (in his search for duplicate lines in file) to consider and compare whole lines anymore. But only first parts of lines until it reaches character '*' (asterisk). Asterisk is separator in my file and everything that comes after asterisk, awk should not bother with (its like he got to end of the line). Asterisk occurs in every line in file but sometimes there is more then one per line (this should not confuse awk, and he should still take into account only first part of line, until first asterisk appears.

If someone can make good solution for this would save me week of work... also eternal gratitude from me

Scrutinizer · March 11, 2012, 9:20am

Try:

awk -F\* '{cur=$1; gsub(/[^[:alnum:]]/, "", cur); if (!a[tolower(cur)]++) print}'

or the same a abit shorter:

awk -F\* '{s=$1; gsub(/[^[:alnum:]]/,x,s)} !a[tolower(s)]++'

shadowww · March 11, 2012, 10:30am

Yep seems like that is exactly what I wanted, only kinda suprised it cut my file almost in half size o_o
Need test a little bit more...

edit: well, this is it 100%
tested and retested, thanks Scrutinizer, love you!