Csv file parsing and validating

shree11 · April 15, 2014, 1:41am

Hi,
I have basic knowledge on unix shell scripting(not an expert).

My requirement is reading the csv file using the schema defined in the configuration file and if the condition is not mached then move the unmatched record to a error file and matched good records into other file.

In brief:
Here i'm defining the schema of an input file in normal text file and I'm calling it as configuration file. My configuration file looks like :
col1 integer NN
col2 string
col3 string NN

Also I'm having a data file ( say its a .csv file) and it looks like :

id,name,location
1,John,FL
2,Merry,,
3,Taylor,CA
a,George,MI

So here i need to write unix sript such a way that while reading the csv data file ,it should refer to the configuration file for the datatype and whether that field is NULL or not. If it is not satisfying the condition it should drop that whole record and move for the next row for the validation check.

Here it has to read the 1st row value from the csv file and check the datatype of that value from the configuration file. If the value type of the 1st row is matched with the datatype of the 1st column , then it has to move the matched record into a new file and if unmatched record is found it has to move those error records into an error file.
How this can be done using UNIX shell scripting.

Here my expected result is :

id,name,age
1,John,FL
3,Taylor,CA

Rest of the records are dropped because :

2,Merry,, ---> here 3rd filed is NULL but in configuration file it is NOT NULL(NN)

a,George,MI --> here 1st field value is STRING where in conf file it is INTEGER.

So how this could be achieved in unix script and what are the commands need to be used to do this.

Please let me know if the scenario is not clear.

Thanks,
Shree11

keelba · April 15, 2014, 10:14am

You only listed two conditionals but hopefully you can take the syntax of this script and modify it to your needs.

#!/bin/bash

> BADFILE
> GOODFILE

cat sourcefile | awk -F, '
  BEGIN { badfile = "BADFILE"; goodfile = "GOODFILE" }
  {
    if($1 !~ /^[0-9]*$/)print $0 >> badfile
    else if($3 == "")print $0 >> badfile
    else print $0 >> goodfile
  }'

rdrtx1 · April 15, 2014, 11:21am

try also:

awk -F, '{x=$1; if (((x+=0) && ($3)) || NR==1) {of="good"} else {of="bad"} ; print > of}' in

shree11 · April 16, 2014, 12:26am

I will let you know what i have tried.
I created one script file called validate.sh

#!/bin/bash

awk -F"|" ' 
BEGIN {
getline var < "configFile"
getline var < "configFile"
n = split( var , arr , "|" )
for ( i = 1 ; i <= n ; ++i )
{
        if( arr == "Numeric" )
                regexp = "^[0-9][0-9]*$"
        if( arr == "String" )
                regexp = "^[a-zA-Z0-9][a-z[A-Z0-9]*$"
}
}
{
for( i = 1 ; i <= NF ; ++i )
{
        if ( ! match ( $i , regexp ) )
                break;
        if ( i == NF )
                print $0
}
}' dataFile

Here the above code is to validate the datatype. Eventhough in my configuration file if i have only 4 columns and in datafile if i have 5 columns, then 5th column is also coming as output which is not expected.

And Data file is
col1|col2|col3|col4
510|abc|xxx|450
510|abc11|yyy|350
510|pqr|zz11z|670
312|22a2|439|110
312|qqq|rrr|11a0
3$33|sss|bbbb|110
333|A11ss|aAa|110

Also here i have to check for the NULL and field length conditions.
how it can be done?

shree11 · April 18, 2014, 7:08am

How my above code changes for the below configuration file(configFile) .

col1,Numeric
col2,String
col3,String
col4,Numeric

Please help me on this.

SriniShoo · April 18, 2014, 7:31am

If both schjema file (conf) and data file (file) as comma separated
Below code will do the job and send eligible records to a file "goodrec" and bad data to "badrec"

awk -F "," 'NR == FNR {a[NR] = $2; next}
  FNR == 1 {print $0 > "goodrec"; print $0 > "badrec"}
  FNR > 1 {for(i = 1; i <= NF; i++)
  {if((a == "integer" && ($i + 0) == $i) || (a == "string" && $i != "")) {f = 1} else {f = 0};
  if(f == 0) {print $0 > "badrec"; next}} print $0 > "goodrec"}' conf file

shree11 · April 18, 2014, 7:46am

Hi SriniSho,

The code suggested by you, we I'm passing the config file?

Does the code taking both the data file and config file into account?

When i do the below , I'm not getting the proper output in the good and bad files:

awk -F "," 'NR == FNR {a[NR] = $2; next}   FNR == 1 {print $0 > "goodrec"; print $0 > "badrec"}   FNR > 1 {for(i = 1; i <= NF; i++)   {if((a == "integer" && ($i + 0) == $i) || (a == "string" && $i != "")) {f = 1} else {f = 0};   if(f == 0) {print $0 > "badrec"; next}} print $0 > "goodrec"}' dataFile

Tahnks,
Shree

SriniShoo · April 18, 2014, 8:36am

You will need to both the files
Schema file first and data file next

shree11 · April 18, 2014, 8:38am

But where can i specify the files in the code. I'm confused.

SriniShoo · April 18, 2014, 9:06am

In the below code, conf.csv is the name of schema file and file.csv is the name of data file

awk -F "," 'NR == FNR {a[NR] = $2; next}
  FNR == 1 {print $0 > "goodrec"; print $0 > "badrec"}
  FNR > 1 {for(i = 1; i <= NF; i++)
  {if((a == "integer" && ($i + 0) == $i) || (a == "string" && $i != "")) {f = 1} else {f = 0};
  if(f == 0) {print $0 > "badrec"; next}} print $0 > "goodrec"}' conf.csv file.csv

shree11 · April 21, 2014, 1:08am

Hi when i do this the records are not filtering to the the condition.In bad records it is considering all the records and in good records it is not considering all the good records.Is i;m missing regex here to match all the conditions. Please suggest me.

awk -F "," 'NR == FNR {a[NR] = $2; next}
  FNR == 1 {print $0 > "goodrec"; print $0 > "badrec"}
  FNR > 1 {for(i = 1; i <= NF; i++)
  {if((a == "Numeric" && ($i + 0) == $i) || (a == "String" && $i != "")) {f = 1} else {f = 0};
  if(f == 0) {print $0 > "badrec"; next}} print $0 > "goodrec"}' cf df

Where the cf and df are given below:
cf:

id,Numeric
name,String
state,String
street_No,Numeric

df:

abc,john,MI,201
22,Lilly,CA,405
33,Taylor,NY,415
44,Reet5,NY,258
55,Richard,CA,21Q

My expected result in goodrec and badrec are given below
goodrec:

22,Lilly,CA,405
33,Taylor,NY,415
44,Reet5,NY,258

badrec:

abc,john,MI,201
55,Richard,CA,21Q

SriniShoo · April 21, 2014, 1:26am

In the first example, you have given, "integer" and "string"
The code is written in such a way to cehck integer and string but not "Numeric" and "String". Even case is important.
you can change the code if you want to use these terms.
Below is the code for Numeric and String. and I am validating the null condition by default for all String fileds

awk -F "," 'NR == FNR {a[NR] = $2; next}
  FNR == 1 {print $0 > "goodrec"; print $0 > "badrec"}
  FNR > 1 {for(i = 1; i <= NF; i++)
  {if((a == "Numeric" && ($i + 0) == $i) || (a == "String" && $i != "")) {f = 1} else {f = 0};
  if(f == 0) {print $0 > "badrec"; next}} print $0 > "goodrec"}' conf.csv file.csv

shree11 · April 21, 2014, 1:33am

Hi,
Here i wanted to check the conditions based on my configuration file i.e., cf
cf:

id,Numeric
name,String
state,String
street_No,Numeric

Here in cf file the datatypes are Numeric and String. So in code i changed it to Numeric and String. But how can i check for the datatype validation in the above code ?

SriniShoo · April 21, 2014, 1:53am

The above given code is updated for Numeric and String.
Tell me if that doesn't work

shree11 · April 21, 2014, 5:09am

No, the above code is not working according to the requirement.
The result i'm getting is
goodrec:

abc,john,MI,201

badrec:

abc,john,MI,201
22,Lilly,CA,405
33,Taylor,NY,415
44,Reet5,NY,258
55,Richard,CA,21Q

---------- Post updated at 04:09 AM ---------- Previous update was at 12:57 AM ----------

Hi,
Any updates on the above problem?
Also i found similar post in the below link. But there they are not passing the structure of the file directly.They are making one file called pat and specifying the patterns over there. But i don't want to do this and wanted to directly pass the definition(structure of the file) directly for validating length and datatype.

Thanks,
Shree

SriniShoo · April 21, 2014, 5:32am

Hey Shree, you cant keep changing the requirement.
In your first code, there was a header and hence written the code to meet that requirement

With the data you gave recently, I ran the below code and it ran fine and produced expected results

awk -F "," 'NR == FNR {a[NR] = $2; next}
  {for(i = 1; i <= NF; i++)
  {if((a == "Numeric" && ($i + 0) == $i) || (a == "String" && $i != "")) {f = 1} else {f = 0};
  if(f == 0) {print $0 > "badrec"; next}} print $0 > "goodrec"}' conf.csv file.csv

$ cat conf.csv
id,Numeric
name,String
state,String
street_No,Numeric
$ cat file.csv
abc,john,MI,201
22,Lilly,CA,405
33,Taylor,NY,415
44,Reet5,NY,258
55,Richard,CA,21Q
$ awk -F "," 'NR == FNR {a[NR] = $2; next}
  {for(i = 1; i <= NF; i++)
  {if((a == "Numeric" && ($i + 0) == $i) || (a == "String" && $i != "")) {f = 1} else {f = 0};
  if(f == 0) {print $0 > "badrec"; next}} print $0 > "goodrec"}' conf.csv file.csv
$ cat goodrec
22,Lilly,CA,405
33,Taylor,NY,415
44,Reet5,NY,258
$ cat badrec
abc,john,MI,201
55,Richard,CA,21Q
$

shree11 · April 21, 2014, 5:46am

Sorry for changing the requirement.

When i run the below script it is giving all the records(good and bad) in a badrec file and it is not at all creating the goodrec file.

awk -F "," 'NR == FNR {a[NR] = $2; next}{for(i = 1; i <= NF; i++){if((a == "Numeric" && ($i + 0) == $i) || (a == "String" && $i != "")) {f = 1} else {f = 0};if(f == 0) {print $0 > "badrec"; next}} print $0 > "goodrec"}' cf df

Don't know what's going wrong here.

SriniShoo · April 21, 2014, 6:08am

Are you sure cf and df has the same content as provided in the previous post

shree11 · April 21, 2014, 6:23am

Yes. My cf and df are given below:
cf

id,Numeric
name,String
state,String
street_No,Numeric

df

abc,john,MI,201
22,Lilly,CA,405
33,Taylor,NY,415
44,Reet5,NY,258
55,Richard,CA,21Q

---------- Post updated at 05:23 AM ---------- Previous update was at 05:13 AM ----------

Hi, Now its working fine. I deleted the existing cf and df files and re-created again.Also in the script at the end specified cf.txt and df.txt. Now its working fine.
I wanted to add some more requirements to the working solution. Shall i open a new thread or continue with the same thread?

Thanks for your guidance and help.

Regards,
Shree

SriniShoo · April 21, 2014, 6:38am

post your questions here