Convert ip ranges to CIDR netblocks

Hi,

Recently I had to convert a 280K lines of ip ranges to the CIDR notation and generate a file to be used by ipset (netfilter) for ip filtering.

Input file:

000.000.000.000 - 000.255.255.255 , 000 , invalid ip
001.000.064.000 - 001.000.127.255 , 000 , XXXXX
001.000.245.123 - 001.000.245.123 , 000 , YYYYYY YYYYY
001.002.002.000 - 001.002.002.255 , 000 , ZZZ ZZZ ZZ
001.002.004.000 - 001.002.004.255 , 000 , AAAA AA

Some of them are range with a single ip.

Required output:

-N cidr nethash --maxelem 260000
-N single iphash --maxelem 60000
-A cidr 0.0.0.0/8
-A cidr 1.0.64.0/18
-A single 1.0.245.123
-A cidr 1.2.2.0/24
-A cidr 1.2.4.0/24
COMMIT

As I got nowhere with awk - the CIDR convertion being the culprit - I found a solution with Python and its netaddr module:

#!/usr/bin/python3

"""
Usage: ip2cidr.py input_file
"""

import sys, re, netaddr

def sanitize (ip):
	seg = ip.split('.')
	return '.'.join([ str(int(v)) for v in seg ])

# pointer to input file
fp_source = open(sys.argv[1], "r")

# pointer to outfile
fp_outfile = open('ip.ipset', "w")

ptrnSplit = re.compile(' - | , ')

# Write ipset header to outfile
fp_outfile.write('-N cidr nethash --maxelem 260000\n-N single iphash --maxelem 60000\n',)

for line in fp_source:
	
	# parse on ' - ' et ' , '
	s = re.split(ptrnSplit, line)
	
	# sanitize ip: 001.004.000.107 --> 1.4.0.107 to avoid netaddr err.
	ip = [ sanitize(v) for v in s[:2] ]
	
	# conversion ip range to CIDR netblocks
	# single ip in range
	if ip[0] == ip[1]:
		fp_outfile.write('-A single %s\n' % ip[0])
		
	# multiple ip's in range
	else:
		ipCidr = netaddr.IPRange(ip[0], ip[1])
		for cidr in ipCidr.cidrs():
			fp_outfile.write('-A cidr %s\n' % cidr)

fp_outfile.write('COMMIT\n')

Time to process the 280K ip ranges: 4 minutes.

As I found that time being on the high side and having a couple of days off, I decided to give awk another try:

@include "lib_netaddr.awk"

function sanitize(ip) {
	split(ip, slice, ".")
	return slice[1]/1 "." slice[2]/1 "." slice[3]/1 "." slice[4]/1
}

BEGIN{
	FS=" , | - "
	print "-N cidr nethash --maxelem 260000\n-N single iphash --maxelem 60000\n"
}

# sanitize ip's
{$1 = sanitize($1); $2 = sanitize($2)}

# range with a single IP
$1==$2 {printf "-A single %s\n", $1} 

# ranges with multiple IP's
$1!=$2{print range2cidr(ip2dec($1), ip2dec($2))}

# footer
END {print "COMMIT\n"}

lib_netaddr.awk

#
#    Library with various ip manipulation functions
#

# convert ip ranges to CIDR notation
# str range2cidr(ip2dec("192.168.0.15"), ip2dec("192.168.5.115"))
#
# Credit to Chubler_XL for this brilliant function. (see his post below for non GNU awk)
#
function range2cidr(ipStart, ipEnd,  bits, mask, newip) {
    bits = 1
    mask = 1
    result = "-A cidr "
    while (bits < 32) {
        newip = or(ipStart, mask)
        if ((newip>ipEnd) || ((lshift(rshift(ipStart,bits),bits)) != ipStart)) {
           bits--
           mask = rshift(mask,1)
           break
        }
        bits++
        mask = lshift(mask,1)+1
    }
    newip = or(ipStart, mask)
    bits = 32 - bits
    result = result dec2ip(ipStart) "/" bits
    if (newip < ipEnd) result = result "\n" range2cidr(newip + 1, ipEnd)
    return result
}

# convert dotted quads to long decimal ip
#	int ip2dec("192.168.0.15")
#
function ip2dec(ip,   slice) {
	split(ip, slice, ".")
	return (slice[1] * 2^24) + (slice[2] * 2^16) + (slice[3] * 2^8) + slice[4]
}

# convert decimal long ip to dotted quads
#	str dec2ip(1171259392)
#
function dec2ip(dec,    ip, quad) {
	for (i=3; i>=1; i--) {
		quad = 256^i
		ip = ip int(dec/quad) "."
		dec = dec%quad
	}
	return ip dec
}


# convert decimal ip to binary
#	str dec2binary(1171259392)
#
function dec2binary(dec,    bin) {
	while (dec>0) {
		bin = dec%2 bin
		dec = int(dec/2)
	}
	return bin
}

# Convert binary ip to decimal
#	int binary2dec("1000101110100000000010011001000")
#
function binary2dec(bin,   slice, l, dec) {
	split(bin, slice, "")
	l = length(bin)
	for (i=l; i>0; i--) {
		dec += slice * 2^(l-i)
	}
	return dec
}

# convert dotted quad ip to binary
#	str ip2binary("192.168.0.15")
#
function ip2binary(ip) {
	return dec2binary(ip2dec(ip))
}


# count the number of ip's in a dotted quad ip range
#	int countIp ("192.168.0.0" ,"192.168.1.255") + 1
#
function countQuadIp(ipStart, ipEnd) {
	return (ip2dec(ipEnd) - ip2dec(ipStart))
}


# count the number of ip's in a CIDR block
#	int countCidrIp ("192.168.0.0/12")
#
function countCidrIp (cidr) {
	sub(/.+\//, "", cidr)
	return 2^(32-cidr)
}

Time to process: 16 sec. A whooping 15 times faster! Not bad for a 43 years old language! And it's even faster with mawk: 7 sec.

Please note that the @include only works with gawk. If you are using the original awk or the lightning fast mawk, you will have to copy/paste the functions library into your main script.

If you find this awk library useful or if it needs to be optimized, let me know before I submit it in Tips & Tutorials section.

6 Likes

How about this for range2cidr (Then call it like this range2cidr(ip2dec($1), ip2dec($2)) :

function range2cidr(ipStart, ipEnd,  bits, mask, newip) {
    bits = 1
    mask = 1
    while (bits < 32) {
        newip = or(ipStart, mask)
        if ((newip>ipEnd) || ((lshift(rshift(ipStart,bits),bits)) != ipStart)) {
           bits--
           mask = rshift(mask,1)
           break
        }
        bits++
        mask = lshift(mask,1)+1
    }
    newip = or(ipStart, mask)
    bits = 32 - bits
    result = dec2ip(ipStart) "/" bits
    if (newip < ipEnd) result = result "\n" range2cidr(newip + 1, ipEnd)
    return result
}

---------- Post updated at 10:24 AM ---------- Previous update was at 08:31 AM ----------

Of course this does require the following gawk bitwise functions: or() lshift() and rshift()

We could be replace these with local (bit_) variants for more portability.

# Bitwise OR of var1 and var2
function bit_or(a, b, r, i, c) {
    for (r=i=0;i<32;i++) {
        c = 2 ^ i
        if ((int(a/c) % 2) || (int(b/c) % 2)) r += c
    }
    return r
}


# Rotate bytevalue left x times
function bit_lshift(var, x) {
  while(x--) var*=2;
  return var;
}

# Rotate bytevalue right x times
function bit_rshift(var, x) {
  while(x--) var=int(var/2);
  return var;
}
4 Likes

Brilliant. Works much better than my original range2cidr() function. I just edited my post above to include your function.

Well done!

Hi, I'm trying to convert bulk IP's into nearest CIDR. I came across your script and was trying to run the awk script in cygwin. Can you send me the syntax on how to run the script along with the library file. Thanks

How to write the code depends entirely on what you want to do with it. Show the input you have and the output you want.

Hi, I have list of IP's ~3k, which are from very small to large subnets. So, I want the IPs to be grouped into subnets that makes sense. The scenario is several groups get IP's based on availability and none of the group should not touch or scan the other IP's. We get the list of IPs based on manual inventory from each group and the key to this part is the provider doesn't manage which set of IPs belong to which group.
So the task is I collected manually all the IPs (which are around 3K) and want to make them into subnets to the nearest class. For example if I have a single IP address it should round off to /32 or if it has 4 ip's it should round off to /29 or /30. I have CIDR tools to do this task, but it needs manual input each time.

I'm looking for a way if I put the 3K ip's into excel or any format the script should round off to nearest subnet class.

The trick to that is, where it should cut off? Hypothetically speaking you can encompass 1.1.1.1 and 254.254.254.254 with the mask 0.0.0.0 but I doubt you want that. You could also do 100% perfect groups with no empty spaces but I doubt you want that either.

Instead of the first suggestion, I would rather go with second option you mentioned.

You could try sorting the IPs in ascending order and continue to add IPs to the subnet until the standard deviation exceeds a limit and rule the subnet of at that point.

Below I've chosen limit at 200, you can play with different values and see how the grouping comes out (somewhere between 100 and 3000 seems fairly good).

Note: just the starting and ending IPs are output for each subnet found, it wouldn't be too complex to determine the closest mask that covers both of these, if needed.

@include "lib_netaddr.awk"

function sanitize(ip) {
    split(ip, slice, ".")
    return slice[1]/1 "." slice[2]/1 "." slice[3]/1 "." slice[4]/1
}

function grpstd(val, tot, cnt, mean, sqtot) {
    for(val in grp) {
       tot=tot + grp[val]
       cnt++
    }
    mean = tot / cnt
    for(val in grp) {
       sqtot = sqtot + (grp[val] - mean) * (grp[val] - mean)
    }
    return sqrt(sqtot / cnt)
}

BEGIN { limit=200 }

{ k[NR]=ip2dec(sanitize($1)) }

END {
    n=asort(k)

    for(idx=1; idx <= n ; idx++) {
       grp[++have]=k[idx]
       if(grpstd() > limit) {
          print "Subnet from " dec2ip(grp[1]) " to " dec2ip(grp[have-1])
          have=split(grp[have], grp)
       }
    }
    if (have)
          print "Subnet from " dec2ip(grp[1]) " to " dec2ip(grp[have])
}

Also note I'm not a statistician and there are probably much more efficient ways this sort of thing could be achieved.

Thanks, I will try testing and update back.

Here is an update that takes into account the subnet outer bounds. This reduces the occurrence of IPs belonging to adjacent subnets being swept up.

The output now includes the subnet mask and a count of IP(s) bounded.

@include "lib_netaddr.awk"

function sanitize(ip) {
    split(ip, slice, ".")
    return slice[1]/1 "." slice[2]/1 "." slice[3]/1 "." slice[4]/1
}

function snbounds(to,i) {
    sn_min=grp[1]
    sn_max=grp[to]

    for(sn_mask=32; sn_mask && sn_min != sn_max; sn_mask--) {
        sn_min = rshift(sn_min,1)
        sn_max = rshift(sn_max,1)
    }

    for(i=32; i>sn_mask; i--) {
        sn_min = lshift(sn_min,1) 
        sn_max = lshift(sn_max,1) + 1
    }
}

function grpstd(val, tot, cnt, mean, sqtot) {
    cnt = length(grp)
    snbounds(cnt)
    tot = sn_min + sn_max
    cnt += 2
    for(val in grp) tot=tot + grp[val]
    mean = tot / cnt
    sqtot = (sn_min - mean) * (sn_min - mean) + \
            (sn_max - mean) * (sn_max - mean)
    for(val in grp) {
       sqtot = sqtot + (grp[val] - mean) * (grp[val] - mean)
    }
    return sqrt(sqtot / cnt)
}

BEGIN { limit=1000 }

{ k[NR]=ip2dec(sanitize($1)) }

END {
    n=asort(k)

    for(idx=1; idx <= n ; idx++) {
       grp[++have]=k[idx]
       # print dec2ip(grp[have]) " std: " grpstd()
       if(grpstd() > limit) {
          snbounds(length(grp)-1)
          print "\nSubnet from " dec2ip(grp[1]) " to " dec2ip(grp[have-1]) " " have - 1 " IP(s)"
          print "Mask " dec2ip(sn_min) "/" sn_mask
          have=split(grp[have], grp)
       }
    }
    if (have) {
          snbounds(length(grp))
          print "\nSubnet from " dec2ip(grp[1]) " to " dec2ip(grp[have]) " " have " IP(s)"
          print "Mask " dec2ip(sn_min) "/" sn_mask
    }
}

Test file example:

$ cat infile
255.20.19.0
10.10.1.25
10.10.2.16
10.10.1.45
192.168.1.129
192.168.1.166
192.168.1.133
10.10.3.30
192.168.1.188
10.10.3.29
10.10.2.20
220.16.53.1
10.10.3.31
10.10.3.16
$ awk -f rgopichand.awk infile

Subnet from 10.10.1.25 to 10.10.3.31 8 IP(s)
Mask 10.10.0.0/22

Subnet from 192.168.1.129 to 192.168.1.188 4 IP(s)
Mask 192.168.1.128/26

Subnet from 220.16.53.1 to 220.16.53.1 1 IP(s)
Mask 220.16.53.1/32

Subnet from 255.20.19.0 to 255.20.19.0 1 IP(s)
Mask 255.20.19.0/32