what is the better way to validate records in a file.

Seshendranath · October 14, 2012, 11:56pm

hi all,

We are checking for the delimited file records validation

Delimited file will have data like this:

Aaaa|sdfhxfgh|sdgjhxfgjh|sdgjsdg|sgdjsg|
Aaaa|sdfhxfgh|sdgjhxfgjh|sdgjsdg|sgdjsg|
Aaaa|sdfhxfgh|sdgjhxfgjh|sdgjsdg|sgdjsg|
Aaaa|sdfhxfgh|sdgjhxfgjh|sdgjsdg|sgdjsg|

So we are checking for where the records of files we got is having validating length or not.

The structer of file/table will be configured in Teradata, we will fetch the column length from tht file.
ex:

col1 varchar(5),
col2varchar(5),
col3varchar(5),
col4 varchar(5)

we hav to check all columns have field length not greater than 5 if its then we will write the hole error record to bad file.

In the script col_nm col_order_num col_len
col_nm =column name
col_order_num =oder number will be order of column in tht table�.it will be 1 2 3�.like tht
col_len=length of the column

#------------------------------------------
#  Reading through the file and checking for the column length
#----------------------------------------------------
                logNote "Reading through the temp file and and checking for the column length"
 
                while read col_nm col_order_num col_len
                do
                                typeset -i col_len
                                typeset -i col_len_good
 
                                col_len_good=`expr $col_len + 1`
 
                                logNote "col_nm : $col_nm"
                                logNote "col_order_num : $col_order_num"
                                logNote "col_len : $col_len"
                                logNote "col_len_good : $col_len_good"
 
                                awk 'BEGIN{col_ord='$col_order_num';col_l='$col_len'}{FS="|"}{if (length($col_ord) > col_l) print $0;}'  $Src_File >> $Src_File.bad
 
                                awk 'BEGIN{col_ord='$col_order_num';col_l='$col_len_good'}{FS="|"}{if (length($col_ord) < col_l) print $0;}'  $Src_File > $Src_File.temp
 
                                rm -f $Src_File
                                mv $Src_File.temp $Src_File
 
                done <$RPT_FILE

================================
we are using this script but its very slow in validating, preformance is very slow
can amy ione come up with soem better way plzs.

Scrutinizer · October 15, 2012, 1:20am

You could try to do it all in awk, that would speed up things. For example:

awk '
  NR==FNR{
    W[$2]=$3
    next
  }
  {
    for(i in W)
      if(length($i)>W){
        print > "file.bad"
        next
      }
  }
  1
' FS='[^0-9]*' colwidthfile FS=\| file

bmk · October 15, 2012, 3:04am

Try like...

awk -F\| 'length($1)<=4 && length($2)<=4 && length($3)<=4 && length($4)<=4' test.txt

Seshendranath · October 16, 2012, 5:21am

i am getting table definition(cloumn size) from database which i hav to fetch ..i am giving the values!!!@bmk

---------- Post updated at 02:51 PM ---------- Previous update was at 02:45 PM ----------

@Scrutinizer can u explain me the code plzs

Scrutinizer · October 16, 2012, 5:44am

Sure:

awk '
  NR==FNR{                                 # When the first file is being read (only then are FNR and NR equal)
    W[$2]=$3                               # create an (associative) array element for the column widths with the second 
                                           # field as the index using the Field separator (FS) (see below)
    next                                   # Proceed to the next record
  }
  {
    for(i in W)                            # for every line in the second file, for every column in array W
      if(length($i)>W){                 # if the length of the corresponding field is more than the max column width then
        print > "file.bad"                 # print that record of the second file to "file.bad"
        next                               # Proceed to the next record
      }
  }
  1                                        # If there are no fields with more characters than the max column width then print the record..
' FS='[^0-9]*' colwidthfile FS=\| file     # Set FS to any sequence of non-digits for the first file. Set it to "|" for the second file.