Hi,
Recently I had to convert a 280K lines of ip ranges to the CIDR notation and generate a file to be used by ipset (netfilter) for ip filtering.
Input file:
000.000.000.000 - 000.255.255.255 , 000 , invalid ip
001.000.064.000 - 001.000.127.255 , 000 , XXXXX
001.000.245.123 - 001.000.245.123 , 000 , YYYYYY YYYYY
001.002.002.000 - 001.002.002.255 , 000 , ZZZ ZZZ ZZ
001.002.004.000 - 001.002.004.255 , 000 , AAAA AA
Some of them are range with a single ip.
Required output:
-N cidr nethash --maxelem 260000
-N single iphash --maxelem 60000
-A cidr 0.0.0.0/8
-A cidr 1.0.64.0/18
-A single 1.0.245.123
-A cidr 1.2.2.0/24
-A cidr 1.2.4.0/24
COMMIT
As I got nowhere with awk - the CIDR convertion being the culprit - I found a solution with Python and its netaddr module:
#!/usr/bin/python3
"""
Usage: ip2cidr.py input_file
"""
import sys, re, netaddr
def sanitize (ip):
seg = ip.split('.')
return '.'.join([ str(int(v)) for v in seg ])
# pointer to input file
fp_source = open(sys.argv[1], "r")
# pointer to outfile
fp_outfile = open('ip.ipset', "w")
ptrnSplit = re.compile(' - | , ')
# Write ipset header to outfile
fp_outfile.write('-N cidr nethash --maxelem 260000\n-N single iphash --maxelem 60000\n',)
for line in fp_source:
# parse on ' - ' et ' , '
s = re.split(ptrnSplit, line)
# sanitize ip: 001.004.000.107 --> 1.4.0.107 to avoid netaddr err.
ip = [ sanitize(v) for v in s[:2] ]
# conversion ip range to CIDR netblocks
# single ip in range
if ip[0] == ip[1]:
fp_outfile.write('-A single %s\n' % ip[0])
# multiple ip's in range
else:
ipCidr = netaddr.IPRange(ip[0], ip[1])
for cidr in ipCidr.cidrs():
fp_outfile.write('-A cidr %s\n' % cidr)
fp_outfile.write('COMMIT\n')
Time to process the 280K ip ranges: 4 minutes.
As I found that time being on the high side and having a couple of days off, I decided to give awk another try:
@include "lib_netaddr.awk"
function sanitize(ip) {
split(ip, slice, ".")
return slice[1]/1 "." slice[2]/1 "." slice[3]/1 "." slice[4]/1
}
BEGIN{
FS=" , | - "
print "-N cidr nethash --maxelem 260000\n-N single iphash --maxelem 60000\n"
}
# sanitize ip's
{$1 = sanitize($1); $2 = sanitize($2)}
# range with a single IP
$1==$2 {printf "-A single %s\n", $1}
# ranges with multiple IP's
$1!=$2{print range2cidr(ip2dec($1), ip2dec($2))}
# footer
END {print "COMMIT\n"}
lib_netaddr.awk
#
# Library with various ip manipulation functions
#
# convert ip ranges to CIDR notation
# str range2cidr(ip2dec("192.168.0.15"), ip2dec("192.168.5.115"))
#
# Credit to Chubler_XL for this brilliant function. (see his post below for non GNU awk)
#
function range2cidr(ipStart, ipEnd, bits, mask, newip) {
bits = 1
mask = 1
result = "-A cidr "
while (bits < 32) {
newip = or(ipStart, mask)
if ((newip>ipEnd) || ((lshift(rshift(ipStart,bits),bits)) != ipStart)) {
bits--
mask = rshift(mask,1)
break
}
bits++
mask = lshift(mask,1)+1
}
newip = or(ipStart, mask)
bits = 32 - bits
result = result dec2ip(ipStart) "/" bits
if (newip < ipEnd) result = result "\n" range2cidr(newip + 1, ipEnd)
return result
}
# convert dotted quads to long decimal ip
# int ip2dec("192.168.0.15")
#
function ip2dec(ip, slice) {
split(ip, slice, ".")
return (slice[1] * 2^24) + (slice[2] * 2^16) + (slice[3] * 2^8) + slice[4]
}
# convert decimal long ip to dotted quads
# str dec2ip(1171259392)
#
function dec2ip(dec, ip, quad) {
for (i=3; i>=1; i--) {
quad = 256^i
ip = ip int(dec/quad) "."
dec = dec%quad
}
return ip dec
}
# convert decimal ip to binary
# str dec2binary(1171259392)
#
function dec2binary(dec, bin) {
while (dec>0) {
bin = dec%2 bin
dec = int(dec/2)
}
return bin
}
# Convert binary ip to decimal
# int binary2dec("1000101110100000000010011001000")
#
function binary2dec(bin, slice, l, dec) {
split(bin, slice, "")
l = length(bin)
for (i=l; i>0; i--) {
dec += slice * 2^(l-i)
}
return dec
}
# convert dotted quad ip to binary
# str ip2binary("192.168.0.15")
#
function ip2binary(ip) {
return dec2binary(ip2dec(ip))
}
# count the number of ip's in a dotted quad ip range
# int countIp ("192.168.0.0" ,"192.168.1.255") + 1
#
function countQuadIp(ipStart, ipEnd) {
return (ip2dec(ipEnd) - ip2dec(ipStart))
}
# count the number of ip's in a CIDR block
# int countCidrIp ("192.168.0.0/12")
#
function countCidrIp (cidr) {
sub(/.+\//, "", cidr)
return 2^(32-cidr)
}
Time to process: 16 sec. A whooping 15 times faster! Not bad for a 43 years old language! And it's even faster with mawk: 7 sec.
Please note that the @include
only works with gawk. If you are using the original awk or the lightning fast mawk, you will have to copy/paste the functions library into your main script.
If you find this awk library useful or if it needs to be optimized, let me know before I submit it in Tips & Tutorials section.