awk eating too much memory?

anishkumarv · September 30, 2011, 12:40am

Hi all,

using AWK iam sorting auniq data from a file the file size is 8GB, while running that script , the over all cpu usage will be nearly 8
how to avoid this ?? any other alternate is available for awk?

Thanks in Advance
Anish kumar.V

itkamaraj · September 30, 2011, 12:41am

you didnt post your awk statement. without seeing the awk statement. we cannot tell, whats going wrong

Corona688 · September 30, 2011, 12:43am

Nearly 8 what? Are you talking about CPU usage or memory now?

I can't tell why your script's doing either without seeing what it is.

If possible you could use the sort utility to sort, most versions use temp files in an intelligent fashion to not overburden the machine with too much memory use.

There's no straightforward way to reduce its CPU use, a CPU intensive operation like sorting is always going to use as much as it can get. But you can easily make it yield to more important things, which is usually just as good. just nice scriptname instead of scriptname

Azrael · September 30, 2011, 2:20am

You should probably use 'sleep' somewhere in your code. This would certainly help. But without your code there it would be hard to say where, or how long you should sleep.

ltomuno · September 30, 2011, 2:40am

too many swaping

alister · September 30, 2011, 6:22am

I would consider that bad advice. If a sleep is not mandatory (such as when polling a resource at an interval), don't do it. What if the machine is otherwise idle when the script is run? Or very busy? What if the script is migrated to a different machine with more muscle? Or less? As circumstances change, in a futile attempt to achieve optimal system utilization, you'd need to tinker with the sleep time.

It's much better to let the operating system's scheduler handle prioritization. That's what it's there for. nice/renice, as Corona688 suggested, is a much better alternative.

Regards,
Alister

anishkumarv · September 30, 2011, 8:04am

Hi all,

Sorry for the late Reply this is my Script Using this script we are using for domain count.

#!/bin/bash

current_date=`date +%d-%m-%Y_%H.%M.%S`
today=`date +%d%m%Y`
yesterday=`date -d 'yesterday' '+%d%m%Y'`
RootPath=/var/domaincount/biz/
LOG=/var/tmp/Intelliscanlog/biz/bizcount$current_date.log

cd $RootPath
echo Intelliscan Process started for .BIZ TLD $current_date >> $LOG

#################################################################################################
## Using Wget Downloading the Zone files it will try only one time
if ! wget --tries=1 --ftp-user=USERNAME --ftp-password=PASSWORD ftp://ftp.URL/zone.gz

then
    echo Download Not Success Domain count Failed With Error >> $LOG
    exit 1
fi
###The downloaded file in Gunzip format from that we need to unzip and start the domain count process####

gunzip zone.gz
mv biz.zone $RootPath/$today.biz

###### It will start the Count #####
awk '/^[^ ]+ IN NS/ && !_[$1]++{print $1; tot++}END{print "\nTotal",tot,"Domains"}' $RootPath/$today.biz > $RootPath/zonefile/$today.biz
awk '/Total/ {print $2}' $RootPath/zonefile/$today.biz > $RootPath/$today.count

### Calculation Part
a=$(< $RootPath/$today.count)
b=$(< $RootPath/$yesterday.count)
c=$(awk 'NR==FNR{a[$0];next} $0 in a{tot++}END{print tot}' $RootPath/zonefile/$today.biz $RootPath/zonefile/$yesterday.biz)


echo "$current_date Today Count For BIZ TlD $a" >> $LOG
echo "$current_date New Registration Domain Counts $((c - a))" >> $LOG
echo "$current_date Deleted Domain Counts $((c - b))" >> $LOG
cat $LOG | mail -s "BIZ Tld Count log" 07anis@gmail.com

Using This Script we remove the duplicates from the file and count the unique domains.

Its Typo error, my overall load average become more than 8 and awk using that time maximum amount of cpu usage.

Corona688 · September 30, 2011, 12:28pm

I'm assuming the main contenders are these lines:

awk '/^[^ ]+ IN NS/ && !_[$1]++{print $1; tot++}END{print "\nTotal",tot,"Domains"}' $RootPath/$today.biz > $RootPath/zonefile/$today.biz
awk '/Total/ {print $2}' $RootPath/zonefile/$today.biz > $RootPath/$today.count

This script can't cause a load average of 8 unless you're running run 8 at once. Do you really want this to run slower? Even more of them might pile up.

The second awk is pointless and doubles the workload of your script since it scans the entire output of the first awk to find one line. The first awk can easily create that file by itself, on-the-fly.

awk -v TOTALFILE="$RootPath/$today.count" '/^[^ ]+ IN NS/ && !_[$1]++{print $1; tot++}
END{
        print "\nTotal",tot,"Domains";
        print tot > TOTALFILE
}' $RootPath/$today.biz > $RootPath/zonefile/$today.biz
# no longer needed, the first awk generates it for us
# awk '/Total/ {print $2}' $RootPath/zonefile/$today.biz > $RootPath/$today.count

---------- Post updated at 10:28 AM ---------- Previous update was at 10:19 AM ----------

Depending on how many cores you have, this might complete even faster:

gunzip < zone.gz | awk -v TOTALFILE="$RootPath/$today.count" -v BIZFILE="$RootPath/$today.biz" '
# print ALL lines into BIZFILE
{ print $0 > BIZFILE }
/^[^ ]+ IN NS/ && !_[$1]++{print $1; tot++}
END{
        print "\nTotal",tot,"Domains";
        print tot > TOTALFILE
}' > $RootPath/zonefile/$today.biz

Do you even need $RootPath/$today.biz at all anymore? IF not this could be simplified further.

anishkumarv · September 30, 2011, 3:45pm

Thanks a lot for your effort brother

gunzip < biz.zone.gz | awk -v TOTALFILE="$RootPath/$today.count" -v BIZFILE="$RootPath/$today.biz"
# print ALL lines into BIZFILE
{ print $0 > BIZFILE }
/^[^ ]+ IN NS/ && !_[$1]++{print $1; tot++}
END{
        print "\nTotal",tot,"Domains";
        print tot > TOTALFILE
} $RootPath/$today.biz > $RootPath/zonefile/$today.biz



###### It will start the Count #####


### Calculation Part
a=$(< $RootPath/$today.count)
b=$(< $RootPath/$yesterday.count)
c=$(awk 'NR==FNR{a[$0];next} $0 in a{tot++}END{print tot}' $RootPath/zonefile/$today.biz $RootPath/zonefile/$yesterday.biz)


echo "$current_date Today Count For BIZ TlD $a" >> $LOG
echo "$current_date New Registration Domain Counts $((c - a))" >> $LOG
echo "$current_date Deleted Domain Counts $((c - b))" >> $LOG
cat $LOG | mail -s "BIZ Tld Count log" 07anis@gmail.com

this is the code exactly iam using but i get error when i execute the script

+ awk -v TOTALFILE=/var/domaincount/biz//01102011.count -v BIZFILE=/var/domaincount/biz//01102011.biz
Usage: awk [POSIX or GNU style options] -f progfile [--] file ...
Usage: awk [POSIX or GNU style options] [--] 'program' file ...
POSIX options: GNU long options:
-f progfile --file=progfile
-F fs --field-separator=fs
-v var=val --assign=var=val
-m[fr] val
-W compat --compat
-W copyleft --copyleft
-W copyright --copyright
-W dump-variables[=file] --dump-variables[=file]
-W exec=file --exec=file
-W gen-po --gen-po
-W help --help
-W lint[=fatal] --lint[=fatal]
-W lint-old --lint-old
-W non-decimal-data --non-decimal-data
-W profile[=file] --profile[=file]
-W posix --posix
-W re-interval --re-interval
-W source=program-text --source=program-text
-W traditional --traditional
-W usage --usage
-W version --version

To report bugs, see node `Bugs' in `gawk.info', which is
section `Reporting Problems and Bugs' in the printed version.

gawk is a pattern scanning and processing language.
By default it reads standard input and writes standard output.

Examples:
gawk '{ sum += $1 }; END { print sum }' file
gawk -F: '{ print $1 }' /etc/passwd
mew.sh: line 31: syntax error near unexpected token `$RootPath/$today.biz'
mew.sh: line 31: `} $RootPath/$today.biz > $RootPath/zonefile/$today.biz

alister · September 30, 2011, 3:53pm

Take a closer look at Corona's post. What you posted is missing single-quotes around the awk script.

Regards,
Alister

Corona688 · September 30, 2011, 4:06pm

To make it clearer:

gunzip < biz.zone.gz | awk -v TOTALFILE="$RootPath/$today.count" -v BIZFILE="$RootPath/$today.biz" '
# print ALL lines into BIZFILE
{ print $0 > BIZFILE }
/^[^ ]+ IN NS/ && !_[$1]++{print $1; tot++}
END{
        print "\nTotal",tot,"Domains";
        print tot > TOTALFILE
}' > $RootPath/zonefile/$today.biz

I also corrected a mistake in it -- awk needs no input filename when fed by a stream!

binlib · October 1, 2011, 1:31pm

Three ways of computing the number of deletions, additions and unchanged, experiment with your data and OS to see which is the best:

# generate raw data
awk -v n=1e6 '
BEGIN {
  srand()
  while (--n > 0)
    printf("abc%dzzz\n", n*rand()) > ARGV[1 + (rand() < 0.55)]
  exit
}
' old.raw new.raw

printf "method:\tdeleted\tadded\tunchanged\n"

# method 1
awk '
NR == FNR {
  if (!($0 in a)) { ++o; a[$0] = -1 }
  next
}
{
  if ((x = ++a[$0]) > 1) next
  if (x < 1) { ++c; a[$0] = 1 }
  else if (x < 2) ++e
  #print
}
END { printf("awk:\t%d\t%d\t%d\n", o - c, e, c) }
' old.raw new.raw #> n.awku

# method 2
sort -u old.raw > o.sortu
oc=$(wc -l < o.sortu)
sort -u new.raw > n.sortu
nc=$(wc -l < n.sortu)
all=$(sort -mu o.sortu n.sortu |wc -l)
printf "sort:\t%d\t%d\t%d\n" $((all-nc)) $((all-oc)) $((oc+nc-all)) 

# method 3
comm o.sortu n.sortu | awk -F'\t' '
 { if ($1)++a; else if ($2) ++b; else ++c }
 END { printf("comm:\t%d\t%d\t%d\n", a, b, c) }'

anishkumarv · October 3, 2011, 7:24pm

Thanks all , for your prompt replies ,

corona688:

To make it clearer:

gunzip < biz.zone.gz | awk -v TOTALFILE="$RootPath/$today.count" -v BIZFILE="$RootPath/$today.biz" '
# print ALL lines into BIZFILE
{ print $0 > BIZFILE }
/^[^ ]+ IN NS/ && !_[$1]++{print $1; tot++}
END{
   print "\nTotal",tot,"Domains";
   print tot > TOTALFILE
}' > $RootPath/zonefile/$today.biz

I also corrected a mistake in it -- awk needs no input filename when fed by a stream!

ya its working fine, dude but my problem was only the file size,

; The use of the Data contained in
; .com, and .net top-level domain zone files (including the checksum
; files) is subject to the restrictions described in the access Agreement
;

$ORIGIN COM.
$TTL 900
@ IN SOA anish.servers.net. anish.servers.net. (
1317183572 ;serial
1800 ;refresh every 30 min
900 ;retry every 15 min
604800 ;expire after a week
86400 ;minimum of 15 min
)
$TTL 172800
NS A.ANISH.SERVERS.NET.
NS G.ANISH.SERVERS.NET.
NS H.ANISH.SERVERS.NET.
NS C.ANISH.SERVERS.NET.
NS I.ANISH.SERVERS.NET.
NS B.ANISH.SERVERS.NET.
NS D.ANISH.SERVERS.NET.
NS L.ANISH.SERVERS.NET.
NS F.ANISH.SERVERS.NET.
NS J.ANISH.SERVERS.NET.
NS K.ANISH.SERVERS.NET.
NS E.ANISH.SERVERS.NET.
NS M.ANISH.SERVERS.NET.
COM. 86400 DNSKEY 256 3 8 AQPEwTIOsHspGTJb1CGweIPOEak/zuDi4ZaDJleSYa/7yoTzmfF9K21W5YRsm5C8F3jGvQbS8kcCqVE1IOiuQ1RNIdq603eSZv68Pzhn43Dhc7NBAEdtygb6cmlGHYvmIcYdYy1hSsj18P1QTGTxmdlXnFQDDol1wwjS4/RwlKwgsQ==
COM. 86400 DNSKEY 257 3 8 AQPDzldNmMvZFX4NcNJ0uEnKDg7tmv/F3MyQR0lpBmVcNcsIszxNFxsBfKNW9JYCYqpik8366LE7VbIcNRzfp2h9OO8HRl+H+E08zauK8k7evWEmu/6od+2boggPoiEfGNyvNPaSI7FOIroDsnw/taggzHRX1Z7SOiOiPWPNIwSUyWOZ79VmcQ1GLkC6NlYvG3HwYmynQv6oFwGv/KELSw7ZSdrbTQ0HXvZbqMUI7BaMskmvgm1G7oKZ1YiF7O9ioVNc0+7ASbqmZN7Z98EGU/Qh2K/BgUe8Hs0XVcdPKrtyYnoQHd2ynKPcMMlTEih2/2HDHjRPJ2aywIpKNnv4oPo/
COM. 86400 NSEC3PARAM 1 0 0 -
COM. RRSIG NS 8 1 172800 20111003041551 20110926030551 41798 COM. bnfhmBWvn+dT+0cFJDf6PtbpXjLoVfd7DxMnm1/6loge1uaLBaIs6/kMOqATZ2TKl2NtfnjHTcekzKUAfDGDcCmSmvjMD4BVOLmHI0Sw5fnedTH+/V0a3EdoslGz64Xj1wLaPdQNEZOpS+zhNY+RD4nI/it+AekIxcLpelICohs=
COM. 86400 RRSIG NSEC3PARAM 8 1 86400 20111003041551 20110926030551 41798 COM. kZJE4UhCffn1QcdyOOP+SUXfRgy8AOVbAIm6FDAZ5KHPny/qvISB5sluDWUFIai1CuugVbgVgUIaWaQqP9X+DP47hmqS8qyCNSQ2fekc2McQlu+dGaTwqcHmSwCrxV7Av6+trzYPkA2X/1m6tVT+T62x1ly/q+GT5DSVUNO/VnQ=
COM. 900 RRSIG SOA 8 1 900 20111005041932 20110928030932 41798 COM. JHj4he/55NXCrGrm7xzrTjsGbgVYlll1YLTUMWw5IPchpJUTe+PhLKZ93Kn3N6lWQ7gNAU/kwzWNa7cBEdfLROB22iCvfCG1S+j2YOKCejvDdsAy+g8yANZOiqaW/2ZAALqJCL2mCXUqyBXYRsgwvks+Ur8bYyM14xUF8KG+cjs=
COM. 86400 RRSIG DNSKEY 8 1 86400 20111001182533 20110924182033 30909 COM. hGnWUsF7zYK3iJODN/HtrcyiPQGFaEMgAHoFNtspTvYFrvDgoZvy/Clt+PuPXoa5k1fl7O1qFXJluuky+9xcGE/E+wqpwoayMah0Xw3lcr3k+MEFky1ofBDFPiN1DWpbPrsR9SwUcndobRETH/cNyujB7B0Mtf10U7+UOK1+CsNmCcrYl8RGgpzPPsIhQnyyI7YnwS7htCo4Ksxx6QjOOahJnDb5IjQt6x4DVXDUJKkpPVPdbOMcwWzNPYDyaBGPVHkyb+lU+G6xLibRPytRhV1dstHyFf6nbIeFudUpG9qNyIQL7N5eRXFUKsWQoAOV+ThNPee2NFkjQmRk0By1Dw==
ENERCONTECHNOLOGIES NS NS1.BIZ.RR
ENERCONTECHNOLOGIES NS NS2.BIZ.RR
SELF-DRIVE-CAR-RENTAL NS NS3.IZP
SELF-DRIVE-CAR-RENTAL NS NS6.IZP
SELF-DRIVE-CAR-RENTAL NS NS7.IZP
SELF-DRIVE-CAR-RENTAL NS IZA.HOSTING.DIGIWEB.IE.
SELF-DRIVE-CAR-RENTAL NS NS4.IZP
NANCYVRAINE NS NS1.IMCONLINE.NET.
NANCYVRAINE NS NS2.IMCONLINE.NET.
SELFDRIVECARRENTAL NS NS3.IZP
SELFDRIVECARRENTAL NS NS6.IZP
SELFDRIVECARRENTAL NS NS7.IZP
SELFDRIVECARRENTAL NS IZA.HOSTING.DIGIWEB.IE.
SELFDRIVECARRENTAL NS NS4.IZP
WORLDDATASOURCE NS NS01.DOMAINCONTROL
NS1.ANISHTECH A 118.99.98.50
NS2.BABYTROLTANISHLE A 174.78.35.253
NS2.LOUBOUTINSALES A 69.04.247.123
NS2.LAURENZAINS A 69.162.67.212
NS1.BUYOURSHOES A 69.56.247.123
NS2.BUYOURSHOES A 69.56.247.123
NS1.HOSTINGALWAYS A 74.52.251.165
NS2.HOSTINGALWAYS A 74.52.251.174
NS1.PETROTECHHAULING A 108.60.199.98
NS1.MAHALAHEATHER A 24.232.33.129
NS2.MAHALAHEATHER A 211.139.119.206
NS1.GENIXTECHNOLOGY A 182.50.129.6
NS2.GENIXTECHNOLOGY A 182.50.129.6
NS1.DEVABKK A 108.62.124.151
NS2.DEVABKK A 108.62.124.151
NS1.LAGISSOTTE A 108.62.124.152
NS1.CONTROLHELM5 A 208.100.8.40
NS2.CONTROLHELM5 A 208.100.8.41
NS2.LAGISSOTTE A 108.62.124.152
NS1.MKT-EVIDENYDIGITAL A 188.138.124.46
NS2.MKT-EVIDENYDIGITAL A 188.138.126.32
NS1.JOANNEHAIR A 108.62.124.153
NS1.TRACKTHEFOOD A 50.22.93.76
NS1.ATOUCHOFSPARKLE-YUMA A 209.202.252.20
NS2.ATOUCHOFSPARKLE-YUMA A 209.202.254.20
NS1.MANDINGOPICS A 64.251.6.3
NS1.MANDINGOPICS A 69.60.115.123
NS2.TRACKTHEFOOD A 50.22.93.77
CONVEYANCINGSYDNEYCITY A 65.39.205.54
OVERLORD.LIGHTENHEIM A 144.38.216.163
DRONE.LIGHTENHEIM A 144.38.216.164
NS1.MONSES A 173.231.56.170
NS2.MONSES A 173.231.56.171
NS2.JOANNEHAIR A 108.62.124.153
NS1.VOQE A 173.254.196.42
NS2.VOQE A 173.254.196.42
NS1.ARIELSTINE A 69.93.119.112
NS2.ARIELSTINE A 69.93.79.100
NS1.DATOBE A 173.231.56.170
NS2.DATOBE A 173.231.56.171
NS1.THECLACH A 108.62.124.154
NS2.THECLACH A 108.62.124.154
NS1.SCREW-PAYPAL A 173.201.24.59
NS2.SCREW-PAYPAL A 173.201.24.59
NS1.INDEXEDHEALTHINFO A 69.93.119.112
NS2.INDEXEDHEALTHINFO A 69.93.79.100
NS1.DIRECTORYPERFECT A 69.93.125.48
NS2.DIRECTORYPERFECT A 69.93.70.45
;End of file

The file contain these kind of data's from that it using awk its sorting only uniqe domain names alone. so even i used your code(Corona688) also its taking time and load,

Corona688 · October 4, 2011, 12:06pm

First you said it's memory, then CPU time, now file size -- which is your goal here?

Of course it takes time and load. 8 gigabytes of data isn't going to be sorted in a nanosecond.

I asked questions which could be used to further improve the code. Is BIZFILE actually needed for anything, now that you don't need to recalculate the database count? If not, leaving out { print $0 > BIZFILE } will avoid a lot of disk-writing and give some more boost.

I'm not quite following the logic in this awk script:

/^[^ ]+ IN NS/ && !_[$1]++{print $1; tot++}

Absolutely nothing in that domain file snippet of yours contains 'IN NS', so that ought to never match. It doesn't look like the first field is what you're actually interested in anyway. How does this work?

---------- Post updated at 10:06 AM ---------- Previous update was at 09:25 AM ----------

I've been trying to think of an awkless way for you, so far I'm stumped.

Building it in pure C means needing an associative array, i.e. I'm ending up just building a hardcoded implementation of awk. It'd have to be a really good associative array to get the necessary speed -- I bet awk's would be faster.

Building it with other shell commands means piping it through grep and cut before feeding it into a sort -u, and then afterwards, reprocessing the output again to get the record count -- either that, or doing a tee and wc -l. That's a 5-long pipe chain for 8GB of data -- in effect processing 40 gigs of data, not 8... That's not going to be more efficient.

I could build a C program that does the grep | cut for you, which would let you pipe it directly into sort -u | tee | wc -l. That's only a 4-long pipe chain... Unless you've got 4 cores, that's probably still not better than the script you have now.

awk's flexible enough to do everything in one shot, which is pretty tough to beat.