awk eating too much memory?

Hi all,

using AWK iam sorting auniq data from a file the file size is 8GB, while running that script , the over all cpu usage will be nearly 8
how to avoid this ?? any other alternate is available for awk?

Thanks in Advance
Anish kumar.V

you didnt post your awk statement. without seeing the awk statement. we cannot tell, whats going wrong

Nearly 8 what? Are you talking about CPU usage or memory now?

I can't tell why your script's doing either without seeing what it is.

If possible you could use the sort utility to sort, most versions use temp files in an intelligent fashion to not overburden the machine with too much memory use.

There's no straightforward way to reduce its CPU use, a CPU intensive operation like sorting is always going to use as much as it can get. But you can easily make it yield to more important things, which is usually just as good. just nice scriptname instead of scriptname

You should probably use 'sleep' somewhere in your code. This would certainly help. But without your code there it would be hard to say where, or how long you should sleep.

too many swaping

I would consider that bad advice. If a sleep is not mandatory (such as when polling a resource at an interval), don't do it. What if the machine is otherwise idle when the script is run? Or very busy? What if the script is migrated to a different machine with more muscle? Or less? As circumstances change, in a futile attempt to achieve optimal system utilization, you'd need to tinker with the sleep time.

It's much better to let the operating system's scheduler handle prioritization. That's what it's there for. nice/renice, as Corona688 suggested, is a much better alternative.

Regards,
Alister

1 Like

Hi all,

Sorry for the late Reply this is my Script Using this script we are using for domain count.

#!/bin/bash

current_date=`date +%d-%m-%Y_%H.%M.%S`
today=`date +%d%m%Y`
yesterday=`date -d 'yesterday' '+%d%m%Y'`
RootPath=/var/domaincount/biz/
LOG=/var/tmp/Intelliscanlog/biz/bizcount$current_date.log

cd $RootPath
echo Intelliscan Process started for .BIZ TLD $current_date >> $LOG

#################################################################################################
## Using Wget Downloading the Zone files it will try only one time
if ! wget --tries=1 --ftp-user=USERNAME --ftp-password=PASSWORD ftp://ftp.URL/zone.gz

then
    echo Download Not Success Domain count Failed With Error >> $LOG
    exit 1
fi
###The downloaded file in Gunzip format from that we need to unzip and start the domain count process####

gunzip zone.gz
mv biz.zone $RootPath/$today.biz

###### It will start the Count #####
awk '/^[^ ]+ IN NS/ && !_[$1]++{print $1; tot++}END{print "\nTotal",tot,"Domains"}' $RootPath/$today.biz > $RootPath/zonefile/$today.biz
awk '/Total/ {print $2}' $RootPath/zonefile/$today.biz > $RootPath/$today.count

### Calculation Part
a=$(< $RootPath/$today.count)
b=$(< $RootPath/$yesterday.count)
c=$(awk 'NR==FNR{a[$0];next} $0 in a{tot++}END{print tot}' $RootPath/zonefile/$today.biz $RootPath/zonefile/$yesterday.biz)


echo "$current_date Today Count For BIZ TlD $a" >> $LOG
echo "$current_date New Registration Domain Counts $((c - a))" >> $LOG
echo "$current_date Deleted Domain Counts $((c - b))" >> $LOG
cat $LOG | mail -s "BIZ Tld Count log" 07anis@gmail.com

Using This Script we remove the duplicates from the file and count the unique domains.

Its Typo error, my overall load average become more than 8 and awk using that time maximum amount of cpu usage.

I'm assuming the main contenders are these lines:

awk '/^[^ ]+ IN NS/ && !_[$1]++{print $1; tot++}END{print "\nTotal",tot,"Domains"}' $RootPath/$today.biz > $RootPath/zonefile/$today.biz
awk '/Total/ {print $2}' $RootPath/zonefile/$today.biz > $RootPath/$today.count

This script can't cause a load average of 8 unless you're running run 8 at once. Do you really want this to run slower? Even more of them might pile up.

The second awk is pointless and doubles the workload of your script since it scans the entire output of the first awk to find one line. The first awk can easily create that file by itself, on-the-fly.

awk -v TOTALFILE="$RootPath/$today.count" '/^[^ ]+ IN NS/ && !_[$1]++{print $1; tot++}
END{
        print "\nTotal",tot,"Domains";
        print tot > TOTALFILE
}' $RootPath/$today.biz > $RootPath/zonefile/$today.biz
# no longer needed, the first awk generates it for us
# awk '/Total/ {print $2}' $RootPath/zonefile/$today.biz > $RootPath/$today.count

---------- Post updated at 10:28 AM ---------- Previous update was at 10:19 AM ----------

Depending on how many cores you have, this might complete even faster:

gunzip < zone.gz | awk -v TOTALFILE="$RootPath/$today.count" -v BIZFILE="$RootPath/$today.biz" '
# print ALL lines into BIZFILE
{ print $0 > BIZFILE }
/^[^ ]+ IN NS/ && !_[$1]++{print $1; tot++}
END{
        print "\nTotal",tot,"Domains";
        print tot > TOTALFILE
}' > $RootPath/zonefile/$today.biz

Do you even need $RootPath/$today.biz at all anymore? IF not this could be simplified further.

Thanks a lot for your effort brother

gunzip < biz.zone.gz | awk -v TOTALFILE="$RootPath/$today.count" -v BIZFILE="$RootPath/$today.biz"
# print ALL lines into BIZFILE
{ print $0 > BIZFILE }
/^[^ ]+ IN NS/ && !_[$1]++{print $1; tot++}
END{
        print "\nTotal",tot,"Domains";
        print tot > TOTALFILE
} $RootPath/$today.biz > $RootPath/zonefile/$today.biz



###### It will start the Count #####


### Calculation Part
a=$(< $RootPath/$today.count)
b=$(< $RootPath/$yesterday.count)
c=$(awk 'NR==FNR{a[$0];next} $0 in a{tot++}END{print tot}' $RootPath/zonefile/$today.biz $RootPath/zonefile/$yesterday.biz)


echo "$current_date Today Count For BIZ TlD $a" >> $LOG
echo "$current_date New Registration Domain Counts $((c - a))" >> $LOG
echo "$current_date Deleted Domain Counts $((c - b))" >> $LOG
cat $LOG | mail -s "BIZ Tld Count log" 07anis@gmail.com

this is the code exactly iam using but i get error when i execute the script :frowning:

Take a closer look at Corona's post. What you posted is missing single-quotes around the awk script.

Regards,
Alister

To make it clearer:

gunzip < biz.zone.gz | awk -v TOTALFILE="$RootPath/$today.count" -v BIZFILE="$RootPath/$today.biz" '
# print ALL lines into BIZFILE
{ print $0 > BIZFILE }
/^[^ ]+ IN NS/ && !_[$1]++{print $1; tot++}
END{
        print "\nTotal",tot,"Domains";
        print tot > TOTALFILE
}' > $RootPath/zonefile/$today.biz

I also corrected a mistake in it -- awk needs no input filename when fed by a stream!

1 Like

Three ways of computing the number of deletions, additions and unchanged, experiment with your data and OS to see which is the best:

# generate raw data
awk -v n=1e6 '
BEGIN {
  srand()
  while (--n > 0)
    printf("abc%dzzz\n", n*rand()) > ARGV[1 + (rand() < 0.55)]
  exit
}
' old.raw new.raw

printf "method:\tdeleted\tadded\tunchanged\n"

# method 1
awk '
NR == FNR {
  if (!($0 in a)) { ++o; a[$0] = -1 }
  next
}
{
  if ((x = ++a[$0]) > 1) next
  if (x < 1) { ++c; a[$0] = 1 }
  else if (x < 2) ++e
  #print
}
END { printf("awk:\t%d\t%d\t%d\n", o - c, e, c) }
' old.raw new.raw #> n.awku

# method 2
sort -u old.raw > o.sortu
oc=$(wc -l < o.sortu)
sort -u new.raw > n.sortu
nc=$(wc -l < n.sortu)
all=$(sort -mu o.sortu n.sortu |wc -l)
printf "sort:\t%d\t%d\t%d\n" $((all-nc)) $((all-oc)) $((oc+nc-all)) 

# method 3
comm o.sortu n.sortu | awk -F'\t' '
 { if ($1)++a; else if ($2) ++b; else ++c }
 END { printf("comm:\t%d\t%d\t%d\n", a, b, c) }'
1 Like

Thanks all , for your prompt replies ,

ya its working fine, dude but my problem was only the file size,

The file contain these kind of data's from that it using awk its sorting only uniqe domain names alone. so even i used your code(Corona688) also its taking time and load,

First you said it's memory, then CPU time, now file size -- which is your goal here?

Of course it takes time and load. 8 gigabytes of data isn't going to be sorted in a nanosecond.

I asked questions which could be used to further improve the code. Is BIZFILE actually needed for anything, now that you don't need to recalculate the database count? If not, leaving out { print $0 > BIZFILE } will avoid a lot of disk-writing and give some more boost.

I'm not quite following the logic in this awk script:

/^[^ ]+ IN NS/ && !_[$1]++{print $1; tot++}

Absolutely nothing in that domain file snippet of yours contains 'IN NS', so that ought to never match. It doesn't look like the first field is what you're actually interested in anyway. How does this work?

---------- Post updated at 10:06 AM ---------- Previous update was at 09:25 AM ----------

I've been trying to think of an awkless way for you, so far I'm stumped.

Building it in pure C means needing an associative array, i.e. I'm ending up just building a hardcoded implementation of awk. It'd have to be a really good associative array to get the necessary speed -- I bet awk's would be faster.

Building it with other shell commands means piping it through grep and cut before feeding it into a sort -u, and then afterwards, reprocessing the output again to get the record count -- either that, or doing a tee and wc -l. That's a 5-long pipe chain for 8GB of data -- in effect processing 40 gigs of data, not 8... That's not going to be more efficient.

I could build a C program that does the grep | cut for you, which would let you pipe it directly into sort -u | tee | wc -l. That's only a 4-long pipe chain... Unless you've got 4 cores, that's probably still not better than the script you have now.

awk's flexible enough to do everything in one shot, which is pretty tough to beat.