Appending ErrorCodes to the corresponding error record

shree11 · June 13, 2014, 3:15am

Hi,

Here i'm using a awk inside bash script to validate the datafile by referring to the configuration file(schema file).
Here the validation check is done for datatype, field length and null values.
Once the validation is done on data file the error records are moved to the bad file. So till here everything is working fine .
Now i wanted to add a error code to each bad record in the badfile along with the error field.

Below i have shown the details of datafile, confile, code tried, output getting and output expected:
configurationfile:

id,Integer(3),NOT NULL
name,String(20)
state,String(5),NOT NULL
phone_no,Integer(4)
gender,Char(1)

datafile:

201,John,MI,4589,M
202,Lilly,FL,589,F
20w,Taylor,,5888,M
210,8888,OK,456
215,Madav,,4454,M
2165,ram,MI,4589,M
21734,Leena,,589,F
218,Rohan,CA,2212,M

Script/Code:

#!/bin/bash
awk -F "," -vDT="$(date +%m%d%Y%H%M)" 'BEGIN { 
GOOD = "good_" DT; #Adding timestamp into a GOOD file
BAD = "bad_" DT; #Adding timestamp into a BAD file
putB = "hadoop fs -put /home/user/data/" BAD " /user/user/bad/"} 
NR == FNR{
gsub("[)(]", "-", $2); 
split($2, a, "-"); 
split($1, c,",");
hh[NR] = c[1]; d[NR] = a[1]; l[NR] = a[2]; n[NR] = ($3 == "NOT NULL") ? 1 : 0; next} 
{
for(i = 1; i <= NF; i++)
{

if(((d == "Integer" && (($i + 0) == $i || $i == "")) || (d == "String" && ($i + 0) != $i) || (d == "Char" && ($i + 0) != $i)) && (length($i) <= l) && (length($i) >= n))
{f = 1} else {f = 0};
if(f == 0) {print $0 > BAD; b++; next} 
}
print $0 > GOOD; g++
}
END {
print "Count of Bad Records : " b;
#system(putB);
}' configfile.txt datafile2.txt

Output getting without errorcode:

20w,Taylor,,5888,M
210,8888,OK,456
215,Madav,,4454,M
2165,ram,MI,4589,M
21734,Leena,,589,F

Expected output along with the error detail along with the error field name:

20w,Taylor,,5888,M,datatypeerror|id
210,8888,OK,456,datatyprerror|name
215,Madav,,4454,M,nullerror|state
2165,ram,MI,4589,M,columnwidthError|id
21734,Leena,,589,F,columnwidthError|id,nullerror|state

In the above expected result, at the end of each record the error details along with the eoorr field. So this can be achieved?

Thanks,
Shree

MadeInGermany · June 13, 2014, 4:02pm

You must split your long all-in-one condition into several smaller conditions, and each time add an individual error message to an error -message variable.
For example:

f=0
error=""
if (d == "Integer" && (($i + 0) == $i || $i == "")) {f=1} else {error = error ",datatypeerror|" hh}
if (d == "String" && ($i + 0) != $i) {f=1} else {error = error ",nullerror|" hh}
if (d == "Char" && ($i + 0) != $i) {f=1} else {error = error ",datatyperror|" hh}
if ((length($i) <= l) && (length($i) >= n)) {f=1} else {error = error ",columnwidthError|" hh}

shree11 · June 16, 2014, 1:11am

Hi,
I tried the similar thing in my code. May be because some syantax checks it's giving me an error.
I got the logic you suggeseted but while implementing the same in the entire code somewhere it's going wrong.
Also i dint get how to insert the piece of code

f=0
error=""

I'm sure somewhere here its going wrong! How this can be acheived?

Thanks,
Shree

shree11 · June 20, 2014, 6:36am

Hi, Is there any updates on the above post ?

CarloM · June 20, 2014, 7:36am

Post your updated script, and the error you are getting.

RudiC · June 20, 2014, 7:52am

Try this

awk -F, 'NR==FNR        {NM[NR]=$1
                           split ($2,X,"[)(]")
                         TP[NR]=toupper(X[1])
                         MX[NR]=X[2]
                         MN[NR]=$3!=""
                         FLDS  =NR
                         next}
                        {OUT=$0
                         for (i=1; i<=FLDS; i++) {
                           L=length($i)
                           N=$i==$i+0
                           if (TP == "INTEGER" && !N) OUT = OUT ",dattyperr|" NM
                           if (TP == "STRING" &&   N) OUT = OUT ",dattyperr|" NM
                           if (L > MX)                OUT = OUT ",colwiderr|" NM
                           if (L < MN)                OUT = OUT ",nullerr|" NM
                          }
                         print OUT
                        }
        ' file1 file2
201,John,MI,4589,M
202,Lilly,FL,589,F
20w,Taylor,,5888,M,dattyperr|id,nullerr|state
210,8888,OK,456,dattyperr|name
215,Madav,,4454,M,nullerr|state
2165,ram,MI,4589,M,colwiderr|id
21734,Leena,,589,F,colwiderr|id,nullerr|state
218,Rohan,CA,2212,M

Checking for CHAR or FLOAT has not yet been implemented. Redirection shouldn't be a problem to you.

shree11 · June 23, 2014, 4:38am

Hi RudiC ,

Your code alone is working fine but i'm facing problem in redirection.When i implement the similar thing into my code its going wrong. Also in your code, you are printing the results on screen but i wanted to print those good and bad records on different files.

Thanks

RudiC · June 23, 2014, 5:13am

Try e.g.

print OUT > (OUT==$0)?GOOD:BAD

shree11 · June 23, 2014, 5:30am

Hi RudiC ,

My code is not giving the proper output. I have 20 records in my file .Out of 20 ,4 records are error records and rest 16 are good records. When i run the script which aontains the errorCode appending mechanism, all my 20 records are written to the good file and bad record file is not even getting created.
Below is the script:

#!/bin/bash
awk -F "," -vDT="$(date +%m%d%Y%H%M)" 'BEGIN { 
GOOD = "good_" DT; #Adding timestamp into a GOOD file
BAD = "bad_" DT; #Adding timestamp into a BAD file
putB = "hadoop fs -put /home/user/data/" BAD " /user/user/bad/"} 
NR == FNR{
gsub("[)(]", "-", $2); 
split($2, a, "-"); 
split($1, c,",");
hh[NR] = c[1]; d[NR] = a[1]; l[NR] = a[2]; n[NR] = ($3 == "NOT NULL") ? 1 : 0; next} 
{
 
for(i = 1; i <= NF; i++)
{
OUT=$0
if(d == "Integer" && (($i + 0) == $i || $i == "")) {f=1} else{OUT = OUT ",dattyperr|" hh}
if(d == "String" && ($i + 0) != $i){f=1} else{OUT = OUT ",dattyperr|" hh}
if(d == "Char" && ($i + 0) != $i) {f=1} else{OUT = OUT ",dattyperr|" hh}
if((length($i) <= l) && (length($i) >= n)){f=1} else{OUT = OUT ",colwiderr|" hh}
if(f == 0) {print OUT > BAD; b++; next} 
}
print OUT > GOOD; g++
}
END {
print "Count of Bad Records : " b;
print "Count of Good Records : " g;
#system(putB);
}' a.txt b.txt

My a.txt file is

id,Integer(3),NOT NULL
name,String(20)
state,String(5),NOT NULL
phone_no,Integer(4)
gender,String(1)

b.txt is:

201,John,MI,4589,M
202,Lilly,FL,589,F
203,Richard,CA,2212,M
204,Cruse,VA,2222,M
20w,Taylor,,5888,M
206,Merry,TX,6969,F
207,,CO,5656,F
208,Tom,AL,5555,M
209,Sam,FL,2586,M
210,8888,OK,456
211,George11,MI,5555,M
212,Reet,MI,4589,M
213,8888a,FL,5899,F
214,Meera,NY,2546,F
215,Madav,,4454,M
216,ram,MI,4589,M
217,Leena,FL,589,F
218,Rohan,CA,2212,M
219,Raj,VA,2222,M
TaylorSmith,NY,5888,M

The output written on the good file is :

201,John,MI,4589,M,dattyperr|gender,dattyperr|gender
202,Lilly,FL,589,F,dattyperr|gender,dattyperr|gender
203,Richard,CA,2212,M,dattyperr|gender,dattyperr|gender
204,Cruse,VA,2222,M,dattyperr|gender,dattyperr|gender
20w,Taylor,,5888,M,dattyperr|gender,dattyperr|gender
206,Merry,TX,6969,F,dattyperr|gender,dattyperr|gender
207,,CO,5656,F,dattyperr|gender,dattyperr|gender
208,Tom,AL,5555,M,dattyperr|gender,dattyperr|gender
209,Sam,FL,2586,M,dattyperr|gender,dattyperr|gender
210,8888,OK,456,dattyperr|phone_no,dattyperr|phone_no
211,George11,MI,5555,M,dattyperr|gender,dattyperr|gender
212,Reet,MI,4589,M,dattyperr|gender,dattyperr|gender
213,8888a,FL,5899,F,dattyperr|gender,dattyperr|gender
214,Meera,NY,2546,F,dattyperr|gender,dattyperr|gender
215,Madav,,4454,M,dattyperr|gender,dattyperr|gender
216,ram,MI,4589,M,dattyperr|gender,dattyperr|gender
217,Leena,FL,589,F,dattyperr|gender,dattyperr|gender
218,Rohan,CA,2212,M,dattyperr|gender,dattyperr|gender
219,Raj,VA,2222,M,dattyperr|gender,dattyperr|gender
TaylorSmith,NY,5888,M,dattyperr|phone_no,dattyperr|phone_no,dattyperr|phone_no

Expected output in bad file is :

20w,Taylor,,5888,M,dattyperr|id,nullerror|state
210,8888,OK,456,dattyperr|name
215,Madav,,4454,M,nullerror|state
TaylorSmith,NY,5888,M,dattyperr|id,dattyperr|state,dattyperr|phone_no

Also in my case, how can i append the nullerror.

Thanks,

RudiC · June 23, 2014, 6:05am

Applied to your latest files, my proposal gives:

GOOD:
201,John,MI,4589,M
202,Lilly,FL,589,F
203,Richard,CA,2212,M
204,Cruse,VA,2222,M
206,Merry,TX,6969,F
207,,CO,5656,F
208,Tom,AL,5555,M
209,Sam,FL,2586,M
211,George11,MI,5555,M
212,Reet,MI,4589,M
213,8888a,FL,5899,F
214,Meera,NY,2546,F
216,ram,MI,4589,M
217,Leena,FL,589,F
218,Rohan,CA,2212,M
219,Raj,VA,2222,M
BAD:
20w,Taylor,,5888,M,dattyperr|id,nullerr|state
210,8888,OK,456,dattyperr|name
215,Madav,,4454,M,nullerr|state
TaylorSmith,NY,5888,M ,dattyperr|id,colwiderr|id,dattyperr|state,dattyperr|phone_no

So, what does not work?

shree11 · June 23, 2014, 6:28am

Hi, your proposal is working fine.
But how can i add the good and bad records on the different files along with the timestamp attached to the good & bad filename. Alos i wanted to count the good and bad records and show it on the console. If i add code related the above requirements i'm getting the error.

Script used for this is:

#!/bin/bash
awk -F "," -vDT="$(date +%m%d%Y%H%M)" 'BEGIN {    
GOOD = "good_" DT;   #Adding timestamp into a GOOD file
BAD = "bad_" DT;     #Adding timestamp into a BAD file
putB = "hadoop fs -put /home/user/data/" BAD " /user/user/bad/"}     
NR == FNR{
gsub("[)(]", "-", $2);    
split($2, a, "-");    
split($1, c,",");
hh[NR] = c[1]; d[NR] = a[1]; l[NR] = a[2];MN[NR]=$3!=""; next}   
{
  
  for(i = 1; i <= NF; i++)
  {
     OUT=$0
     L=length($i)
     N=$i==$i+0
     if (d == "INTEGER" && !N) OUT = OUT ",dattyperr|" hh
     if (d == "STRING" &&   N) OUT = OUT ",dattyperr|" hh
     if (L > l)                OUT = OUT ",colwiderr|" hh
     if (L < MN)                OUT = OUT ",nullerr|" hh
 }
      print OUT > (OUT==$0)?GOOD:BAD
}
END {
print "Count of Bad Records : " b;
print "Count of Good Records : " g;
#system(putB);
    }' configfile.txt datafile2.txt

Also how can i count the good and bad records?

RudiC · June 23, 2014, 6:37am

Try

         OUT==$0        {print OUT > GOOD
                         g++
                         next}
                        {print OUT > BAD
                         b++}
         END            {print "Good: ", g, ", bad: ", b}

shree11 · June 23, 2014, 7:43am

Hi,

Still i'm getting some syntax error in the below code:

#!/bin/bash
awk -F "," -vDT="$(date +%m%d%Y%H%M)" 'BEGIN {    
GOOD = "good_" DT;   #Adding timestamp into a GOOD file
BAD = "bad_" DT;     #Adding timestamp into a BAD file
putB = "hadoop fs -put /home/user/data/" BAD " /user/user/bad/"}     
NR == FNR{
gsub("[)(]", "-", $2);    
split($2, a, "-");    
split($1, c,",");
hh[NR] = c[1]; d[NR] = a[1]; l[NR] = a[2];MN[NR]=$3!=""; next}   
{
  for(i = 1; i <= NF; i++)
  {
     OUT=$0
     L=length($i)
     N=$i==$i+0
     if (d == "INTEGER" && !N) OUT = OUT ",dattyperr|" hh
     if (d == "STRING" &&   N) OUT = OUT ",dattyperr|" hh
     if (L > l)                OUT = OUT ",colwiderr|" hh
     if (L < MN)                OUT = OUT ",nullerr|" hh
     OUT==$0 {print OUT > GOOD; g++;next}
  }
     {print OUT > BAD: b++}
}
END {      
print "Good: ", g, ", bad: ", b
print "parsing sucess!!";
print "Count of Bad Records : " b;
print "Count of Good Records : " g;
#system(putB);
    }' a.txt b.txt

CarloM · June 23, 2014, 8:06am

It's much easier for people to help you if you provide the actual error message.

shree11 · June 23, 2014, 8:09am

Error after excecuting the above code is :

awk: cmd. line:19:      OUT==$0 {print OUT > GOOD; g++;next}
awk: cmd. line:19:              ^ syntax error
awk: cmd. line:21:      {print OUT > BAD: b++}
awk: cmd. line:21:                      ^ syntax error

RudiC · June 23, 2014, 8:18am

Try to copy this as is into your script. The OUT==$0 is a pattern so it needs to be outside the braces, and you need a semicolon, not a colon after BAD:

                         if (L < MN)                OUT = OUT ",nullerr|" hh
                        }
        OUT==$0         {print OUT > GOOD; g++;next}
                        {print OUT > BAD; b++}

shree11 · June 23, 2014, 8:46am

Hi,

I don't know what is going wrong here. Code is working fine but everything is written on to the good file.No bad file is getting created.
The script is :

#!/bin/bash
awk -F, -vDT="$(date +%m%d%Y%H%M)" 'BEGIN {    
#Adding timestamp into a GOOD and BAD file
GOOD = "good_" DT;   
BAD = "bad_" DT;     
#Loading validated good and bad records into HDFS
putG = "hadoop fs -put /home/user/sapthashree/galileo/validation/" GOOD " /user/user/dataparse/good/";       
putG1 = "hadoop fs -put /home/user/sapthashree/galileo/validation/" GOOD " /user/user/dataparse/hdfs_good/";  
putB = "hadoop fs -put /home/user/sapthashree/galileo/validation/" BAD " /user/user/dataparse/bad/"}         

NR==FNR {h[NR]=$1
                         
split ($2,a,"[)(]")
d[NR]=toupper(a[1])
l[NR]=a[2]
n[NR]=$3!=""
FLDS  =NR
next}
{
OUT=$0
for (i=1; i<=FLDS; i++)
{
L=length($i)
N=$i==$i+0
if (d == "INTEGER" && !N) OUT = OUT ",dattyperr|" h
if (d == "STRING" &&   N) OUT = OUT ",dattyperr|" h
if (L > l)                OUT = OUT ",colwiderr|" h
if (L < n)                OUT = OUT ",nullerr|" h
}
OUT==$0 
{print OUT > GOOD; g++;next}
{print OUT > BAD; b++}
}
END {      
print "parsing sucess!!";
print "Count of Bad Records : " b;
print "Count of Good Records : " g;
#system(putB);
    }' a.txt b.txt

On console i'm seeing:

parsing sucess!!
Count of Bad Records :
Count of Good Records : 20

Contents of good_062320142336 is :

201,,MI,4589,M
202,Lilly,FL,589,F
203,Richard,CA,2212,M
204,Cruse,VA,2222,M
20w,Taylor,,5888,M,dattyperr|id,nullerr|state
206,Merry,TX,6969,F
207,,CO,5656,F
208,Tom,AL,5555,M
209,Sam,FL,2586,M
210,8888,OK,456,dattyperr|name
211,George11,MI,5555,M
212,Reet,MI,4589,M
213,8888a,FL,5899,F
214,Meera,NY,2546,F
215,Madav,,4454,M,nullerr|state
216,ram,MI,4589,M
217,Leena,FL,589,F
218,Rohan,CA,2212,M
219,Raj,VA,2222,M
TaylorSmith,NY,5888,M,dattyperr|id,colwiderr|id,dattyperr|state,dattyperr|phone_no

Hope you are alos referring to the same code.I know something silly mistake i'm doing here but not getting where it's going wrong!

Thanks

RudiC · June 23, 2014, 9:31am

}
OUT==$0 
{print OUT > GOOD; g++;next}
{print OUT > BAD; b++}
}                                        # put this brace up three lines; before OUT==$0
END {

shree11 · June 23, 2014, 11:52pm

Hi, I did the way you suggested but still it's not working fine. It's displaying good records on console and displaying all good and bad records with errorcode attached to the bad records into a single good file. Bad file is not getting created.

Code i used :

if (L < n) OUT = OUT ",nullerr|" h
}
}
OUT==$0 
{print OUT > GOOD; g++;next}
{print OUT > BAD; b++}
END {

Don_Cragun · June 24, 2014, 1:16am

shree11:

Hi, I did the way you suggested but still it's not working fine. It's displaying good records on console and displaying all good and bad records with errorcode attached to the bad records into a single good file. Bad file is not getting created.

Code i used :
if (L < n) OUT = OUT ",nullerr|" h
}
}
OUT==$0 
{print OUT > GOOD; g++;next}
{print OUT > BAD; b++}
END { 

There is a HUGE difference between the code RudiC suggested:

        OUT==$0         {print OUT > GOOD; g++;next}
                        {print OUT > BAD; b++}

and the code you're using:

OUT==$0 
{print OUT > GOOD; g++;next}
{print OUT > BAD; b++}

The condition OUT==$0 must be placed on the same line as the opening brace for the actions to be performed when that condition is true.

Otherwise, as you have seen, you get the default action for the good lines (print to standard output) and write every line (the default when no condition is specified) to the file named by GOOD and skip remaining lines in the script ( next ) (so nothing is written to the file named by BAD on the next line of your script.