Sorting based on multiple delimiters

gimley · May 2, 2011, 10:51pm

Hello,
I have data where words are separated by a delimiter. In this case "="
The number of delimiters in a line can vary from 4to 8. The norm is 4.
Is it possible to have a script where the file could be separated starting with highest number of delimiters and ending with the lowest
An example is given below:

INPUT

a=b=c=d=e
a=b=c=d=e=d=g=a=b
a=b=c=d=e=f
a=b=c=d=e=f=g
a=b=c=d=e=f=g=a
a=b=c=d=e=f=g=a=c
a=b=c=d=e=f=h
a=b=n=d=e=f
a=b=p=d=e
h=b=c=d=e=f=g=a

EXPECTED OUTPUT
What I would like is the following:

8 delimiters
a=b=c=d=e=d=g=a=b
a=b=c=d=e=f=g=a=c
7 delimiters
h=b=c=d=e=f=g=a
a=b=c=d=e=f=g=a
6 delimiters
a=b=c=d=e=f=g
a=b=c=d=e=f=h
5 delimiters
a=b=n=d=e=f
a=b=c=d=e=f
4 delimiters
a=b=c=d=e
a=b=p=d=e

The file is very large around 300,000 lines.
I know that a regex can do the job, but I don't know how to introduce regexes in awk or perl

Many thanks in advance

SORRY FOR MULTIPLE POSTING. MY NETWORK COLLAPSED JUST WHEN I WAS SUBMITTING THE FILE

rdcwayx · May 2, 2011, 11:33pm

awk -F = 'NR==1{max=NF;min=NF}
         {max=(max>NF)?max:NF;min=(min<NF)?min:NF;a[NF]=(a[NF]=="")?$0:a[NF] ORS $0}
    END{for (i=max;i>=min;i--) {if (a!="") print i-1 " delimiters" ORS a}}' infile

8 delimiters
a=b=c=d=e=d=g=a=b
a=b=c=d=e=f=g=a=c
7 delimiters
a=b=c=d=e=f=g=a
h=b=c=d=e=f=g=a
6 delimiters
a=b=c=d=e=f=g
a=b=c=d=e=f=h
5 delimiters
a=b=c=d=e=f
a=b=n=d=e=f
4 delimiters
a=b=c=d=e
a=b=p=d=e

gimley · May 3, 2011, 12:10am

Hello,
I tested the file and what I get is the message
0 delimiters
followed by the full set of sample test data.
I checked the script abd the syntax shows that the files should be sorted as per number of delimiters.
What has gone wrong ?
I am enclosing the testdata as a zip file.
Many thanks

kevintse · May 3, 2011, 12:24am

Try this:

awk -F= '{print NF, $0}' infile | sort -k1 -nr | awk '!d||$1!=d{d=$1; print d-1 " delimiters"}{print $2}'

rdcwayx · May 3, 2011, 2:32am

No problem I found.

If you run the awk in Solaris, please replace the command with nawk or /usr/xpg4/bin/awk

awk -F = 'NR==1{max=NF;min=NF}
         {max=(max>NF)?max:NF;min=(min<NF)?min:NF;a[NF]=(a[NF]=="")?$0:a[NF] ORS $0}
    END{for (i=max;i>=min;i--) {if (a!="") print i-1 " delimiters" ORS a}}' test |head -10

6 delimiters
pathan=inayat=khan=rashid=khan=sahebzadi=m
shiv=ram=tandale=ganesh=laxman=hirabai=m
5 delimiters
gore=bibi=sakina=irfanali=tayeba=f
jamadar=aves=ahmed=ashfaque=sherbano=m
ram=tandale=ganesh=laxman=hirabai=m
4 delimiters
kale=amita=bhanudas=shobha=f
lande=amit=chandrabhan=asha=m

---------- Post updated at 04:32 PM ---------- Previous update was at 04:25 PM ----------

Clever way.

little adjust (!a[$1]++) to look better, and -k1 is useless.

awk -F= '{print NF, $0}' infile | sort -nr |awk '!a[$1]++ {print $1-1 " delimiters" }{print $2}'

kevintse · May 3, 2011, 3:13am

rdcwayx:

No problem I found.

If you run the awk in Solaris, please replace the command with nawk or /usr/xpg4/bin/awk
awk -F = 'NR==1{max=NF;min=NF}
   {max=(max>NF)?max:NF;min=(min<NF)?min:NF;a[NF]=(a[NF]=="")?$0:a[NF] ORS $0}
   END{for (i=max;i>=min;i--) {if (a!="") print i-1 " delimiters" ORS a}}' test |head -10

6 delimiters
pathan=inayat=khan=rashid=khan=sahebzadi=m
shiv=ram=tandale=ganesh=laxman=hirabai=m
5 delimiters
gore=bibi=sakina=irfanali=tayeba=f
jamadar=aves=ahmed=ashfaque=sherbano=m
ram=tandale=ganesh=laxman=hirabai=m
4 delimiters
kale=amita=bhanudas=shobha=f
lande=amit=chandrabhan=asha=m
---------- Post updated at 04:32 PM ---------- Previous update was at 04:25 PM ----------

Clever way.

little adjust (!a[$1]++) to look better, and -k1 is useless.
awk -F= '{print NF, $0}' infile | sort -nr |awk '!a[$1]++ {print $1-1 " delimiters" }{print $2}'

!a[$1]++ does look better, but it exposes a little overhead than !d||$1!=d, because it has to increment a[$1] by 1 for each line.
And again, -k1 is not useless. it is still for performance reason, if it is left out, sort has to take the entire line to sort the output, while if it is present, sort only needs to sort the first field(the delimiter count).

gimley · May 3, 2011, 4:00am

Hello,
Unluckily I am working in Windows and have to fall back on GAWK/NAWK for windows.
Maybe this is the reason why I get the message
0 delimiters.
I should have mentioned the same to you at the outset. Sorry for the hassle. Any turn-around is possible?

kevintse · May 3, 2011, 4:06am

I am working on Windows(Cygwin), too, and I am using gawk.
If you are not using Cygwin, I recommend you give it a shot.

gimley · May 4, 2011, 12:20am

Sorry. I tried Cygwin: I had a system crash and am back to Windows after a painful recovery. Any hope of an alternative solution. I have tried:
awk
nawk
gawk32
but I resolutely get the standard 0 delimiters.
Guess I'll have to use the painful method of identifying the data through regexes and getting the words out.
Many thanks for all your advice.