Determining Word Frequency of Specific Terms

richsark · March 5, 2009, 1:58pm

Hello,
I require a perl script that will read a .txt file that contains words like

224.199.207.IN-ADDR.ARPA. IN NS NS1.internet.com.
4.200.162.207.in-addr.arpa. IN PTR beeriftw.internet.com.
arroyoeinternet.com. IN A 200.199.227.49

I want to focus on words:
IN NS
IN PTR
IN A
IN CNAME

I like to get a output that looks like:

Total number of NS records =
Total number of PTR records=
Total number A records=
Total number of CNAME=

Thanks in advance

jim_mcnamara · March 5, 2009, 2:41pm

#!/usr/bin/perl -n
# cnt.pl	
	my $ns = 0;
	my $ptr = 0;
	my $a = 0;
	my $cname = 0;
	while(<>)
	{
	  if (/IN NS/   ) {$ns++;   }
               if (/IN PTR/  ) {$ptr++;  }
               if (/IN A/    ) {$a++;    }
               if (/IN CNAME/) {$cname++;}
             }

     print "NS    records =", $ns   , "\n";
     print "PTR   records =", $ptr  , "\n";
     print "A     records =", $a    , "\n";
     print "CNAME records =", $cname, "\n";

usage: cnt.pl < logfile

FWIW this is really not a perl type thing - awk is probably better IMO.

radoulov · March 5, 2009, 3:06pm

Or:

perl -ane'
  $_{$F[2]}++;
  print map "Total number of $_ records:\t$_{$_}\n", 
    keys %_ if eof
  ' infile

With AWK:

(use nawk or /usr/xpg4/bin/awk on Solaris)

awk 'END {
  for (k in _) 
    printf "Total number of %s:\t%d\n", k, _[k]
	}
{ _[$3]++ }' infile

richsark · March 5, 2009, 6:25pm

I have many zone files or dns zones that contain various record types.

is it to much to ask to add some finesse to my request.

Example: I could have

db.208.199.11.0

That would contain the below information

224.199.207.IN-ADDR.ARPA. IN NS AIM1.internet.com.
4.200.162.207.in-addr.arpa. IN PTR beeriftw.internet.com.
arroyoeinternet.com. IN A 200.199.227.49

Then another file
db.explorer.com would contain

224.162.207.IN-ADDR.ARPA.       IN NS   pwedns1.internet.com.
224.162.207.IN-ADDR.ARPA.       IN NS   pmedns1.internet.com.
224.162.207.IN-ADDR.ARPA.       IN NS   phedns1.internet.com.
224.162.207.IN-ADDR.ARPA.       IN NS   auth100.ns.aut.net.

So what I am requesting is to create input file that has these names in it that would use your script to count against.

So the output may look like for each word in my input file

db.208.199.11.0:
Total number of A records = 684
Total number of PTR records = 306
Total number of CNAME records = 58
Total number of NS records = 1352

db.explorer.com;
Total number of A records = 6
Total number of PTR records = 30
Total number of CNAME records = 88
Total number of NS records = 55

So rather then having it look for each txt file like my original thought, is have the script reference a master input file.

Thanks in advance !

radoulov · March 6, 2009, 3:48am

No need to create a master input file, AWK (or Perl, whichever you prefer) could process multiple input files. So assuming all your files reside in the same directory and all filenames begin with the string db:

awk 'END {
  print f ":"
    for (Z in z)
      printf "Total number of %s records = %d\n", \
      Z, z[Z]
    print RS
    }
FNR == 1 {
  if (f) {
    print f ":"
    for (Z in z)
      printf "Total number of %s records = %d\n", \
      Z, z[Z]
    print RS
    }
    f = FILENAME
  }    
{ z[$3]++ }' db*

richsark · March 6, 2009, 7:18am

Hi Thanks for your reply, I ran your code, the out put looks like:

db.255.0.0.0:
Total number of SOA records = 17
Total number of records = 187
Total number of Serial records = 17
Total number of NS records = 17
Total number of Retry records = 17
Total number of OF records = 17
Total number of PTR records = 166
Total number of ; records = 17
Total number of Refresh records = 17
Total number of from: records = 17
Total number of FILE records = 68
Total number of Expire records = 17

Its spitting out alot of stuff, not sure what some mean like :
Total number of ; records = 17
Total number of Expire records = 17

Where is it getting that from? and can we tweak it?

radoulov · March 6, 2009, 7:20am

Yes,
it seems that not all records have the same format. Could you post a bigger sample of your data that includes records containing the offending patterns (Serial, Retry, Expire etc.)?

Perhaps something like this will be sufficient:

awk 'END {
  print f ":"
    for (Z in z)
      printf "Total number of %s records = %d\n", \
      Z, z[Z]
    print RS
    }
FNR == 1 {
  if (f) {
    print f ":"
    for (Z in z)
      printf "Total number of %s records = %d\n", \
      Z, z[Z]
    print RS
    }
    f = FILENAME
  }    
$2 == "IN" { z[$3]++ }' db*

richsark · March 6, 2009, 7:43am

Hi, Can we take out:

Total number of SOA records = 30

I only need records showing the below in each db.x

PTR
MX
NS
CNAME
A

The code must be smart to look at tabs/spaces I guess??

a copy of a db.x looks like

;
; THIS FILE IS AUTOMATICALLY GENERATED. DO NOT EDIT IT.
; THIS FILE IS AUTOMATICALLY GENERATED. DO NOT EDIT IT.
; THIS FILE IS AUTOMATICALLY GENERATED. DO NOT EDIT IT.
; THIS FILE IS AUTOMATICALLY GENERATED. DO NOT EDIT IT.
;
; generated from: $Id: master.txt,v 2.1230 2009/01/05 22:29:21 root Exp $
;

$TTL 3600

beerprime.com. IN SOA iqedns1.internet.com. hostmaster.beer.com. (
2009010501 ; Serial
900 ; Refresh
300 ; Retry
1209600 ; Expire
3600 ) ; Minimum
beerprime.com. IN NS iqdns1.internet.com.

integ4 IN A 192.168.205.156
beerprime.com. IN A 192.168.205.175
www IN CNAME intg4.beerprime.com.
86.96.168.192.in-addr.arpa. IN PTR sepapp.beerprime.com

;
; END OF beerprime.com
;

Thanks

radoulov · March 6, 2009, 7:45am

It's smart enough
Try this and let me know if the output is OK:

awk 'END {
  print f ":"
    for (Z in z)
      printf "Total number of %s records = %d\n", \
      Z, z[Z]
    print RS
    }
FNR == 1 {
  if (f) {
    print f ":"
    for (Z in z)
      printf "Total number of %s records = %d\n", \
      Z, z[Z]
    print RS
    }
    f = FILENAME
  }    
$3 ~ /^(PTR|MX|NS|CNAME|A)$/ { z[$3]++ }' db*

richsark · March 6, 2009, 8:12am

Hi !

OK, looks like we got a count issue, see db.beerstearns.com

beerstearns.com. IN SOA iqedns1.internet.com. hostmaster.beer.com. (
2009010501 ; Serial
900 ; Refresh
300 ; Retry
1209600 ; Expire
3600 ) ; Minimum
bearstearns.com. IN NS iqedns1.internet.com.

fbhp IN A 192.168.205.124
futures IN A 192.168.205.165
bigdog IN A 192.168.205.195
bigdog2 IN A 192.168.205.196

; SPECIALS
;
situnifiedportal.bearstearns.com. IN NS whdgss1cnis-pri1.clearco.com.
situnifiedportal.bearstearns.com. IN NS metgss1cnis-sec1.clearco.com.
qa.bearstearns.com. IN NS whdgss1cnis-pri1.clearco.com.
qa.bearstearns.com. IN NS metgss1cnis-sec1.clearco.com.

The output came out as:

db.bearstearns.com:

Total number of CNAME records = 1
Total number of A records = 6
Total number of NS records = 26
Total number of PTR records = 166

There is 4 A records, I dont see CNAME.

I also need a count if it detects the word "Special"
So maybe
Total number of Special records = 4
Sorry, I just noticed that

radoulov · March 6, 2009, 8:50am

You're right, I have to empty the array at the beginning of every file. Try this one:

awk 'END {
  print f ":"
    for (Z in z)
      printf "Total number of %s records = %d\n", \
      Z, z[Z]
    print RS
    }
FNR == 1 {
  if (f) {
    print f ":"
    for (Z in z)
      printf "Total number of %s records = %d\n", \
      Z, z[Z]
    if (sc) printf "Total number of Special records = %d\n", \
    sc    
    print RS
    split(x, z)
    s = sc = 0
    }
    f = FILENAME
  }    
$3 ~ /^(PTR|MX|NS|CNAME|A)$/ { z[$3]++; s && sc++ }
/SPECIALS/ { s = 1 }' db*

Do you want the special records in the total or you want a separate count for them?
For db.beerstearns.com you want this:

db.beerstearns.com:
Total number of A records = 4
Total number of NS records = 5
Total number of Special records = 4

Or this:

db.beerstearns.com:
Total number of A records = 4
Total number of NS records = 1
Total number of Special records = 4

richsark · March 6, 2009, 9:11am

Hi, I would like to have it like this:

Or this:

Code:
db.beerstearns.com:

Total number of A records = 4
Total number of NS records = 1
Total number of Special records = 4

radoulov · March 6, 2009, 9:26am

Is this OK?
Do you want the IN strings (I don't know the exact word :)) for the special records too or the count is sufficient?

awk 'END {
  print f ":"
    for (Z in z)
      printf "Total number of %s records = %d\n", \
      Z, z[Z]
    print RS
    }
FNR == 1 {
  if (f) {
    print f ":"
    for (Z in z)
      printf "Total number of %s records = %d\n", \
      Z, z[Z]
    if (sc) printf "Total number of Special records = %d\n", \
    sc    
    print RS
    split(x, z)
    s = sc = 0
    }
    f = FILENAME
  }    
$3 ~ /^(PTR|MX|NS|CNAME|A)$/ {  
  if (s) sc++ 
  else z[$3]++
  }
/SPECIALS/ { s = 1 }' db*

richsark · March 6, 2009, 9:30am

Awesome !

Thanks a whole bunch !!

richsark · March 6, 2009, 10:29am

Hi, I just noticed your comment on Specials. I found out that "SPECIALS" may contain MX records, TXT records or any record, so if we leave it alone, it wont care what type of record, just the count right?

radoulov · March 6, 2009, 10:42am

Now it counts only the PTR, MX, NS, CNAME and A records.
If you want it to count all type of special records containing the IN string, you should use this code (I fixed one more bug in the END block):

awk 'END {
  print f ":"
    for (Z in z)
      printf "Total number of %s records = %d\n", \
      Z, z[Z]
    if (sc) printf "Total number of Special records = %d\n", \
    sc  
    print RS
    }
FNR == 1 {
  if (f) {
    print f ":"
    for (Z in z)
      printf "Total number of %s records = %d\n", \
      Z, z[Z]
    if (sc) printf "Total number of Special records = %d\n", \
    sc    
    print RS
    split(x, z)
    s = sc = 0
    }
    f = FILENAME
  }    
$3 ~ /^(PTR|MX|NS|CNAME|A)$/ && !s { z[$3]++ }
s && $2 == "IN" { sc++ }
/SPECIALS/ { s = 1 }' db*

ShawnMilo · March 6, 2009, 10:54am

#!/usr/bin/env python

import re

p_types = re.compile(r'^IN\s+(\w+)\s*.*$')

log_lines = open(r'c:\temp\temp.txt', 'r')

type_counts = {}

for line in log_lines:
    if p_types.search(line):
        log_type = p_types.sub(r'\1', line)
        type_counts.setdefault(log_type, 0)
        type_counts[log_type] += 1
        
for log_type, count in type_counts.items():
    
    print "Total number of %s records: %d" % (log_type, count)

richsark · March 6, 2009, 11:06am

OK, So from my previous observation, the count for this particular zone came out like so:

db.local.internet.com:
Total number of CNAME records = 23
Total number of A records = 444
Total number of NS records = 6
Total number of Special records = 162

The count between A records and Special records for this off if different.

The SPECIALS section has mixed records types from MX, NS, A

So perhaps the count from any Specials remain separated from the master count.

I also noticed that under SPECIALS it has MX records, do you think it also was included in the master count for SPECIALS?

I guess its hard to know from this script what is counted from what. Unless you can think of a better way to report on this. I am just throwing ideas out.

What do you think?

radoulov · March 6, 2009, 11:25am

Let me clarify,
the last version of the code implements the following logic:

For every input file:

1.1. If the third field matches the following regular expression:

^(PTR|MX|NS|CNAME|A)$

Which means that the third field exactly matches one of the following strings:

PTR or MX or NS or CNAME or A

1.2. AND we've not yet reached the SPECIALS section:

!s

1.3. We count each occurrence of the value of the third field (building the associative array z).

{ z[$3]++ }

When we reach the SPECIALS section (s) AND the second field matches exactly the string IN we count every record:

s && $2 == "IN" { sc++ }

2.1. The following code marks the beginning of the SPECIALS section, it gets reset at the beginning of every file:

/SPECIALS/ { s = 1 }

Is the above logic clear and correct?

richsark · March 6, 2009, 11:37am

Yes, Its clear now. I just wasn't sure. Thanks for the details radoulov.

I guess I have deviated from my request since this was learn as you go for me.

If its not too much to ask, Under the SPECIALS section we add Values so we know what it contains. My vision looks like

db.xyz
Total number of CNAME records = 23
Total number of A records = 23
Total number of NS records = 6
-----------------------------------------------
Total number of Special records = 8 <<<<<<< total from below
Total number of A records in Special = 3
Total number of NS records in Special = 2
Total number of MX records in Special = 2
Total number of PTR records in Special = 1

Again, thanks for all your efforts !!