awk script (complex)

slashbash · March 11, 2012, 3:33pm

picked this up from another thread.

echo 1st_file.csv; nawk -F, 'NR==FNR{a[$1OFS$2OFS$3]++;next} a[$1OFS$2OFS$3]{b[$1OFS$2OFS$3]++}
END{for(i in b){if(b-1&&a!=b){print i";\t\t"b}else{print "NEW:"i";\t\t"b} } }' OFS=, 1st_file.csv *.csv | sort -r

i need to use the above but with a slight modification..

1.compare against 3 month records i.e 1st file = 08_03_2012 compared to all files in previous months up to 08_12_2011 any lines repeating more then > 20 is printed off and incremented with total

2.continue with above of comparing 08_03_2012 with all file records & anything not matching is printed off as new

shamrock · March 11, 2012, 3:35pm

Post a sample of the input and output...

slashbash · March 11, 2012, 3:42pm

above script is the input

NEW:NREE_CISCO3750,10,2          2
NEW:NPGG_CISCO3750,10,4	         1
AREE_CISCO3750,11,7		62
HHGE_CISCO3750,18,7		11
IIUE_CISCO3750,20,2		27
MRTAN_CISCO3750,17,4		84

Chubler_XL · March 11, 2012, 5:21pm

You should take a deep breath and start again. Pretend we can't see the files your talking about and know nothing about what they contain or represent.

For a start how is the awk script supposed to determine that the date for a file (eg you said 1st file = 08_03_2012)?

When you say "comparing 08_03_2012 with all file records" and "any lines repeating" are you just talking about the contents of fields 10,20 and 3?

slashbash · March 11, 2012, 5:58pm

ok, the cron produces the files with the latest date stamp i.e

ciscostats_11032012
ciscostats_10032012
ciscostats_09032012

can i ask awk to compare i.e

ciscostats_11032012; ciscostats_11012012

Chubler_XL · March 11, 2012, 7:44pm

awk can do this but it would be easier to use the OS/shell tools to cut down the filelist to those that need to be processed first.

What OS and shell are you using? does you date command support the -d option? e.g.:

$ date -d "20120320 - 90 days" +%Y%m%d
20111221

slashbash · March 11, 2012, 7:51pm

Solaris 8

yes

# date -d
date: illegal option -- d
usage:  date [-u] mmddHHMM[[cc]yy][.SS]
        date [-u] [+format]
        date -a [-]sss[.fff]

Chubler_XL · March 11, 2012, 9:45pm

Since Solaris 8 dosn't offer may options for date calculations you will probably have to end up using perl (or c) to do this one.

If you have any control of the cron jobs, I'd suggest swapping the date format around to YYYYMMDD. This would make things a lot easier.

slashbash · March 11, 2012, 9:50pm

thanks for your help but it will not be possible for me to do this as I do not have control of the cron.

Chubler_XL · March 11, 2012, 11:13pm

This might be close to what you need.

It compares the newest file against the 50 most recent and any record that appears in more than 20 files is printed:

cd /path/to/cisco/logs
files=`ls ciscostats_* | sort -t_ -k2.5 -k2.3,2.4 -k2.1,2.2 | tail -50`
first=`echo "$files" | tail -1`
 
nawk -F, 'NR==FNR{a[$1","$2","$3]++;next}
a[$1","$2","$3]{b[$1","$2","$3]++}
END{for(i in b)if(b>20)print i";\t\t"b}' $first $files

slashbash · March 11, 2012, 11:15pm

Thanks Chubler but I still need it to compare against all other files and any lines not matching it prints as new, see my second point.

Chubler_XL · March 11, 2012, 11:40pm

OK give this a try, you can change the value LOOK to control how many files are searched and MATCH for how many must match.

cd /path/to/cisco/logs
files=`ls ciscostats_* | sort -t_ -k2.5 -k2.3,2.4 -k2.1,2.2`
first=`echo "$files" | tail -1`
 
awk -F, -vLOOK=50 -vMATCH=20 '
  FNR==1{F++}F==1{a[$1","$2","$3]++;next}
  a[$1","$2","$3]&&F<LOOK{b[$1","$2","$3]++}
  a[$1","$2","$3]{c[$1","$2","$3]++}
  END{for(i in c)if(b>MATCH)print i";\t\t"b;else if(c-1)print "NEW:"i";\t\t"b}' $first $files

slashbash · March 11, 2012, 11:50pm

im just getting the new output with that code, no >20

Chubler_XL · March 12, 2012, 12:07am

Are the numbers your getting on the NEW: lines bigger than 20? I'm still a bit confused about what NEW lines should be, if it's just records that only appear in the most recent file then this might work better:

cd /path/to/cisco/logs
files=`ls ciscostats_* | sort -t_ -k2.5 -k2.3,2.4 -k2.1,2.2`
first=`echo "$files" | tail -1`
 
awk -F, -vLOOK=50 -vMATCH=20 '
  FNR==1{F++}F==1{a[$1","$2","$3]++;next}
  a[$1","$2","$3]&&F<LOOK{b[$1","$2","$3]++}
  a[$1","$2","$3]{c[$1","$2","$3]++}
  END{for(i in c)if(b>MATCH)print i";\t\t"b;else if(c==a)print "NEW:"i";\t\t"c}' $first $files

slashbash · March 12, 2012, 1:34am

yes and no.

8 of the 10 were >20

however

2 out of the 10 were <20 but were genuinely new

so it looks like it's printing everything off as new.

also comparing against the initial code in my first post on this thread it has missed off one genuine new line, I would say this is down to the -vLOOK=50 variable

having amended the LOOK to 6000 it has still missed the line out.

---------- Post updated at 05:34 AM ---------- Previous update was at 04:26 AM ----------

chubler_xl:

Are the numbers your getting on the NEW: lines bigger than 20? I'm still a bit confused about what NEW lines should be, if it's just records that only appear in the most recent file then this might work better:
cd /path/to/cisco/logs
files=`ls ciscostats_* | sort -t_ -k2.5 -k2.3,2.4 -k2.1,2.2`
first=`echo "$files" | tail -1`
 
awk -F, -vLOOK=50 -vMATCH=20 '
  FNR==1{F++}F==1{a[$1","$2","$3]++;next}
  a[$1","$2","$3]&&F<LOOK{b[$1","$2","$3]++}
  a[$1","$2","$3]{c[$1","$2","$3]++}
  END{for(i in c)if(b>MATCH)print i";\t\t"b;else if(c==a)print "NEW:"i";\t\t"c}' $first $files

interestingly the "missing" new line appears with your new code, but now only this line appears?

by a new line, I mean

if(b[i]-1&&a[i]!=b[i])
if(b[i]-1)
## if ["element that i indexed of b array's count - 1"] has a value
## so there must be at least one record
a[i]!=b
[i]## if "i indexed element of b array's count" and "i indexed element of a array's count" is not equal
## so trying to be sure that is there a record in the other files?
## if not equal then there is a record in the other files
## so it is a OLD line
## else it will be a NEW line

Chubler_XL · March 12, 2012, 3:54am

OK think I have it now:

cd /path/to/cisco/logs
files=`ls ciscostats_* | sort -t_ -k2.5r -k2.3,2.4r -k2.1,2.2r`
awk -F, -vLOOK=50 -vMATCH=20 '
   FNR==1{F++}F==1{a[$1","$2","$3]++;next}
   {i=$1","$2","$3;if(!(i in a))next}
   F<=LOOK{b++}
   {c[$1","$2","$3]++}
   END{for(i in a)if(b>0&&a+b>=MATCH){print i";\t\t"a+b}else if(c+0==0)print "NEW:"i";\t\t"a}' $files

slashbash · March 12, 2012, 4:13pm

no.

I am getting all new lines printed off as 1, some these could be more then 1 for example 2+ and still be new, plus in code we are not comparing against all records for new lines to be printed off just against 50 (I know this is variable but could we not incorporate this check)

The first script has it to a tea i.e compares current file against everything then prints off new lines ok, just problem is I need it to also check current file against 3 months worth of files then print off >20

Chubler_XL · March 12, 2012, 5:31pm

Perhaps I'm misunderstanding your requirement.

I used LOOK=2 and MATCH=3 for these files:

*** ciscostats_08032012 ***
B,2,1
C,1,1
D,5,5
*** ciscostats_09032012 ***
B,2,1
B,2,1
*** ciscostats_10032012 ***
A,1,1
A,1,1
B,2,1
D,5,5

and this is the output I get/expect:

B,2,1;          3
NEW:A,1,1;              2

If this is wrong ,perhaps you could supply a sample file set with low MATCH/LOOK counts that demonstrate what you want.

slashbash · March 12, 2012, 6:56pm

nawk -F, 'NR==FNR{a[$1OFS$2OFS$3]++;next} a[$1OFS$2OFS$3]{b[$1OFS$2OFS$3]++}
END{for(i in b){if(b-1&&a!=b){print i";\t\t"b}else{print "NEW:"i";\t\t"b} } }' OFS=, ciscostats_10032012 *.csv | sort -r

above code compares all file lines with NR==FNR

old repeat lines are dumped into array b where indexed lines are incremented.

It also prints off any new indexed lines in array a with an increment after comparing to array b, where no match is found then it must be new.

I think we can modify both these scripts in order to serve the purpose, my only question would be can we run the scripts simultaneously which is what I want?

i.e the script above can be modified to only produce the new lines (and we can remove some of the unnecessary bits i.e the repeat incremental lines from array b (but prob still need to keep this array in order to do the new line comparison with array a, if you understand the logic)

we can use your script with LOOK and MATCH variables to compare the last 3 month records anything >20

think this is possible..

Chubler_XL · March 12, 2012, 9:50pm

The code you supplied produces the following output for my 3 test files:

D,5,5;          2
B,2,1;          4
NEW:A,1,1;              2

We have simplified your requirement (1.) to "only look at the first 2 files" (ie with a LOOK value of 2) and this will change the output to:

NEW:D,5,5;              1
B,2,1;          3
NEW:A,1,1;              2

Requirement (2.) that NEW should check all available files (i.e. ciscostats_08032012 is checked as well) will produce:

B,2,1;          3
NEW:A,1,1;              2

This is because "D,5,5" is in ciscostats_08032012, so it's not new.

This output matches the output of the script I supplied in post #16, you have said that #16 is wrong but I still can't see what it's doing that you dont like.

---------- Post updated at 11:50 AM ---------- Previous update was at 09:27 AM ----------

Looking back over this thread, I suspect you are reading the code I have supplied, and determining it's not doing what you want. Rather than trying it out with actual data, so it's probably time for me to explain what it does:

$files is populated with a list of data files with the most recent first eg:
ciscostats_02012012
ciscostats_01012012
ciscostats_31122011

a[] contains a count of how many times each ID appears in the first (most recent) file.

b[] contains a count of how many times an ID from a[] appears in files 2 thru LOOK

c[] contains a count of how many times an ID from a[] appears in any other file

At the end we print any ID that appears in both a[] and b[], and has a[]+b[] count >= MATCH
otherwise, a "NEW" record is output if value appears in a[] and not in c[]