Csv file parsing and validating

Hi,
I have basic knowledge on unix shell scripting(not an expert).

My requirement is reading the csv file using the schema defined in the configuration file and if the condition is not mached then move the unmatched record to a error file and matched good records into other file.

In brief:
Here i'm defining the schema of an input file in normal text file and I'm calling it as configuration file. My configuration file looks like :
col1 integer NN
col2 string
col3 string NN

Also I'm having a data file ( say its a .csv file) and it looks like :

id,name,location
1,John,FL
2,Merry,,
3,Taylor,CA
a,George,MI

So here i need to write unix sript such a way that while reading the csv data file ,it should refer to the configuration file for the datatype and whether that field is NULL or not. If it is not satisfying the condition it should drop that whole record and move for the next row for the validation check.

Here it has to read the 1st row value from the csv file and check the datatype of that value from the configuration file. If the value type of the 1st row is matched with the datatype of the 1st column , then it has to move the matched record into a new file and if unmatched record is found it has to move those error records into an error file.
How this can be done using UNIX shell scripting.

Here my expected result is :

id,name,age
1,John,FL
3,Taylor,CA

Rest of the records are dropped because :

2,Merry,, ---> here 3rd filed is NULL but in configuration file it is NOT NULL(NN)

a,George,MI --> here 1st field value is STRING where in conf file it is INTEGER.

So how this could be achieved in unix script and what are the commands need to be used to do this.

Please let me know if the scenario is not clear.

Thanks,
Shree11

You only listed two conditionals but hopefully you can take the syntax of this script and modify it to your needs.

#!/bin/bash

> BADFILE
> GOODFILE

cat sourcefile | awk -F, '
  BEGIN { badfile = "BADFILE"; goodfile = "GOODFILE" }
  {
    if($1 !~ /^[0-9]*$/)print $0 >> badfile
    else if($3 == "")print $0 >> badfile
    else print $0 >> goodfile
  }'

try also:

awk -F, '{x=$1; if (((x+=0) && ($3)) || NR==1) {of="good"} else {of="bad"} ; print > of}' in

I will let you know what i have tried.
I created one script file called validate.sh

#!/bin/bash

awk -F"|" ' 
BEGIN {
getline var < "configFile"
getline var < "configFile"
n = split( var , arr , "|" )
for ( i = 1 ; i <= n ; ++i )
{
        if( arr == "Numeric" )
                regexp = "^[0-9][0-9]*$"
        if( arr == "String" )
                regexp = "^[a-zA-Z0-9][a-z[A-Z0-9]*$"
}
}
{
for( i = 1 ; i <= NF ; ++i )
{
        if ( ! match ( $i , regexp ) )
                break;
        if ( i == NF )
                print $0
}
}' dataFile

Here the above code is to validate the datatype. Eventhough in my configuration file if i have only 4 columns and in datafile if i have 5 columns, then 5th column is also coming as output which is not expected.

Here my congiguration file is :
col1|col2|col3|col4
Numeric|String|String|Numeric

And Data file is
col1|col2|col3|col4
510|abc|xxx|450
510|abc11|yyy|350
510|pqr|zz11z|670
312|22a2|439|110
312|qqq|rrr|11a0
3$33|sss|bbbb|110
333|A11ss|aAa|110

Also here i have to check for the NULL and field length conditions.
how it can be done?

How my above code changes for the below configuration file(configFile) .

col1,Numeric
col2,String
col3,String
col4,Numeric

Please help me on this.

If both schjema file (conf) and data file (file) as comma separated
Below code will do the job and send eligible records to a file "goodrec" and bad data to "badrec"

awk -F "," 'NR == FNR {a[NR] = $2; next}
  FNR == 1 {print $0 > "goodrec"; print $0 > "badrec"}
  FNR > 1 {for(i = 1; i <= NF; i++)
  {if((a == "integer" && ($i + 0) == $i) || (a == "string" && $i != "")) {f = 1} else {f = 0};
  if(f == 0) {print $0 > "badrec"; next}} print $0 > "goodrec"}' conf file

Hi SriniSho,

The code suggested by you, we I'm passing the config file?

Does the code taking both the data file and config file into account?

When i do the below , I'm not getting the proper output in the good and bad files:

awk -F "," 'NR == FNR {a[NR] = $2; next}   FNR == 1 {print $0 > "goodrec"; print $0 > "badrec"}   FNR > 1 {for(i = 1; i <= NF; i++)   {if((a == "integer" && ($i + 0) == $i) || (a == "string" && $i != "")) {f = 1} else {f = 0};   if(f == 0) {print $0 > "badrec"; next}} print $0 > "goodrec"}' dataFile

Tahnks,
Shree

You will need to both the files
Schema file first and data file next

But where can i specify the files in the code. I'm confused.

In the below code, conf.csv is the name of schema file and file.csv is the name of data file

awk -F "," 'NR == FNR {a[NR] = $2; next}
  FNR == 1 {print $0 > "goodrec"; print $0 > "badrec"}
  FNR > 1 {for(i = 1; i <= NF; i++)
  {if((a == "integer" && ($i + 0) == $i) || (a == "string" && $i != "")) {f = 1} else {f = 0};
  if(f == 0) {print $0 > "badrec"; next}} print $0 > "goodrec"}' conf.csv file.csv

Hi when i do this the records are not filtering to the the condition.In bad records it is considering all the records and in good records it is not considering all the good records.Is i;m missing regex here to match all the conditions. Please suggest me.

awk -F "," 'NR == FNR {a[NR] = $2; next}
  FNR == 1 {print $0 > "goodrec"; print $0 > "badrec"}
  FNR > 1 {for(i = 1; i <= NF; i++)
  {if((a == "Numeric" && ($i + 0) == $i) || (a == "String" && $i != "")) {f = 1} else {f = 0};
  if(f == 0) {print $0 > "badrec"; next}} print $0 > "goodrec"}' cf df 

Where the cf and df are given below:
cf:

id,Numeric
name,String
state,String
street_No,Numeric

df:

abc,john,MI,201
22,Lilly,CA,405
33,Taylor,NY,415
44,Reet5,NY,258
55,Richard,CA,21Q

My expected result in goodrec and badrec are given below
goodrec:

22,Lilly,CA,405
33,Taylor,NY,415
44,Reet5,NY,258

badrec:

abc,john,MI,201
55,Richard,CA,21Q

In the first example, you have given, "integer" and "string"
The code is written in such a way to cehck integer and string but not "Numeric" and "String". Even case is important.
you can change the code if you want to use these terms.
Below is the code for Numeric and String. and I am validating the null condition by default for all String fileds

awk -F "," 'NR == FNR {a[NR] = $2; next}
  FNR == 1 {print $0 > "goodrec"; print $0 > "badrec"}
  FNR > 1 {for(i = 1; i <= NF; i++)
  {if((a == "Numeric" && ($i + 0) == $i) || (a == "String" && $i != "")) {f = 1} else {f = 0};
  if(f == 0) {print $0 > "badrec"; next}} print $0 > "goodrec"}' conf.csv file.csv

Hi,
Here i wanted to check the conditions based on my configuration file i.e., cf
cf:

id,Numeric
name,String
state,String
street_No,Numeric

Here in cf file the datatypes are Numeric and String. So in code i changed it to Numeric and String. But how can i check for the datatype validation in the above code ?

The above given code is updated for Numeric and String.
Tell me if that doesn't work

No, the above code is not working according to the requirement.
The result i'm getting is
goodrec:

abc,john,MI,201

badrec:

abc,john,MI,201
22,Lilly,CA,405
33,Taylor,NY,415
44,Reet5,NY,258
55,Richard,CA,21Q

---------- Post updated at 04:09 AM ---------- Previous update was at 12:57 AM ----------

Hi,
Any updates on the above problem?
Also i found similar post in the below link. But there they are not passing the structure of the file directly.They are making one file called pat and specifying the patterns over there. But i don't want to do this and wanted to directly pass the definition(structure of the file) directly for validating length and datatype.

Thanks,
Shree

Hey Shree, you cant keep changing the requirement.
In your first code, there was a header and hence written the code to meet that requirement

With the data you gave recently, I ran the below code and it ran fine and produced expected results

awk -F "," 'NR == FNR {a[NR] = $2; next}
  {for(i = 1; i <= NF; i++)
  {if((a == "Numeric" && ($i + 0) == $i) || (a == "String" && $i != "")) {f = 1} else {f = 0};
  if(f == 0) {print $0 > "badrec"; next}} print $0 > "goodrec"}' conf.csv file.csv
$ cat conf.csv
id,Numeric
name,String
state,String
street_No,Numeric
$ cat file.csv
abc,john,MI,201
22,Lilly,CA,405
33,Taylor,NY,415
44,Reet5,NY,258
55,Richard,CA,21Q
$ awk -F "," 'NR == FNR {a[NR] = $2; next}
  {for(i = 1; i <= NF; i++)
  {if((a == "Numeric" && ($i + 0) == $i) || (a == "String" && $i != "")) {f = 1} else {f = 0};
  if(f == 0) {print $0 > "badrec"; next}} print $0 > "goodrec"}' conf.csv file.csv
$ cat goodrec
22,Lilly,CA,405
33,Taylor,NY,415
44,Reet5,NY,258
$ cat badrec
abc,john,MI,201
55,Richard,CA,21Q
$

Sorry for changing the requirement.

When i run the below script it is giving all the records(good and bad) in a badrec file and it is not at all creating the goodrec file.

awk -F "," 'NR == FNR {a[NR] = $2; next}{for(i = 1; i <= NF; i++){if((a == "Numeric" && ($i + 0) == $i) || (a == "String" && $i != "")) {f = 1} else {f = 0};if(f == 0) {print $0 > "badrec"; next}} print $0 > "goodrec"}' cf df

Don't know what's going wrong here.

Are you sure cf and df has the same content as provided in the previous post

Yes. My cf and df are given below:
cf

id,Numeric
name,String
state,String
street_No,Numeric

df

abc,john,MI,201
22,Lilly,CA,405
33,Taylor,NY,415
44,Reet5,NY,258
55,Richard,CA,21Q

---------- Post updated at 05:23 AM ---------- Previous update was at 05:13 AM ----------

Hi, Now its working fine. I deleted the existing cf and df files and re-created again.Also in the script at the end specified cf.txt and df.txt. Now its working fine.
I wanted to add some more requirements to the working solution. Shall i open a new thread or continue with the same thread?

Thanks for your guidance and help.

Regards,
Shree

post your questions here