AWK Sorting with range values

Diwakar9 · October 19, 2012, 4:32pm

Hello,

I am looking for some help on GAWK script. I have a list of phone numbers as below.
I need to sort these in the range of first 6 digits.

So for the above sample data, I require an output as below:

240217	0338	0744
240310	0025	0028

The pattern is:
<first six digits> <range start> <range end>

Please suggest a solution for this.

rdrtx1 · October 19, 2012, 4:59pm

 
sort -n infile | awk '
{
  n=substr($0,1,6);
  r=substr($0,7);
  if (!a[n]++) {
    p[pc++]=n;
    l[n]=9999999;
    h[n]=-9999999;
  }
  r < l[n] ? l[n]=r : 0;
  r > h[n] ? h[n]=r : 0;
}
END {
  for (i=0; i<pc; i++) print p, l[p], h[p];
}'

Don_Cragun · October 19, 2012, 8:45pm

The script provided by rdrtx1 can be simplified since we know the input is sorted numerically by the time awk sees it:

sort -n infile | awk '
{       if(last != substr($0, 1, 6)) {
                if(last) print last, low, high
                last = substr($0, 1, 6)
                low = substr($0, 7)
        }
        high = substr($0, 7)
}
END {   if(last) print last, low, high}'

If you know that your input file will never be an empty file, you can omit if(last) from the last line of the script.

alister · October 20, 2012, 12:11am

I don't recommend the following over Don's suggestion. My approach is less efficient and less maintainable. I offer it only for your amusement, as my attempt at a shortest solution which restricts itself to POSIX-standard utilities.

sort -n infile | sed 's/./& /6' | awk '$1""!=n {print _; n=$1} 1' | awk '{print $1,$2,$NF}' RS=

Note: The pair of double-quotes after $1 shouldn't be necessary, but the oooold mawk that I was testing with wouldn't treat $1 as a string (which current POSIX rules require in that context).

If anyone can conjure something shorter, I'd love to see it.

Regards,
Alister

elixir_sinari · October 20, 2012, 3:06am

Dropping the sed (and consequently, a process and making it just a little shorter :D):

sort -n infile|awk '{sub(/.{6}/,"& ")} $1!=n{print _; n=$1}1'|awk '{print $1,$2,$NF}' RS=

I have dropped the double-quotes around $1 as the OP seems to be using gawk, which I think will use string comparison in that case.

pamu · October 20, 2012, 4:18am

My try on it.....

sort -n file | sed 's/./& /6' | awk '!X[$1]++{printf b?" "b"\n"$0:$0}{b=$2}END{printf " "b}'

alister · October 20, 2012, 11:11am

Nope. Your suggestion is actually longer. It may be more efficient without the extra process in the pipeline, but it uses more characters.

Nice try, pamu, but yours is even a bit longer than elixir's. Also, your suggestion does not output a valid text file (it's missing the final newline).

Removing unnecessary whitespace and using filename "f":

alister: 79
sort -n f|sed 's/./& /6'|awk '$1!=n{print _;n=$1}1'|awk '{print $1,$2,$NF}' RS=

elixir: 82
sort -n f|awk '{sub(/.{6}/,"& ")}$1!=n{print _;n=$1}1'|awk '{print $1,$2,$NF}' RS=

pamu: 85 (but not quite a correct solution)
sort -n f|sed 's/./& /6'|awk '!X[$1]++{printf b?" "b"\n"$0:$0}{b=$2}END{printf " "b}'

While I've not removed it, I believe there is no need to use sort's -n option. Unless there exists a locale in which the digits do not sort from 0 to 9 -- if such a locale exists, I would truly appreciate being made of aware of it -- if the numbers are the same length, lexicographical and numerical sorting will yield identical results.

Regards,
Alister

Don_Cragun · October 20, 2012, 12:43pm

In section 5.2.1 of the C Standard titled Character Sets it says:

In the POSIX Standards, XBD section 6.1 titled Portable Character Set places this requirement from the C Standard on all locales supported by a system conforming to the POSIX Standards or to the more stringent requirements of the Single UNIX Specification.

So, the -n isn't needed in sort when all of the strings of digits being sorted are all the same length.

Although it doesn't matter in this case, note that the same can't be said for uppercase letters nor lowercase letters. Adding 0 through 26 to 'a' ('A') to get the lowercase (uppercase) letters in sequence happens to work in supersets of ASCII, but it won't work in EBCDIC. There are gaps between 'i' and 'j' ('I' and 'J') and between 'r' and 's' ('R' and 'S') in EBCDIC.

pamu · October 20, 2012, 1:27pm

Yes... But after fixing missing new line, length got reduced by one(still behind you and elixir..:D)
Now 84..

sort -n file | sed 's/./& /6' | awk '!X[$1]++{printf b?b"\n"$0" ":$0" "}{b=$2}END{print b}'

And with this 83..
actually i m not getting what i can call this, may be cheating/Not good way of programming but it also helps to reduce one count.
Still few more required..:wall:

sort -n file | sed 's/./& /6' | awk '!X[$1]++{printf b?" "b"\n"$0:$0}{b=$2}END{print k,b}'

Diwakar9 · October 22, 2012, 10:05am

Hello Don, your solution worked very nicely. Thanks for precise code.