Check whether a given file is in ASCII format and data is tab-delimited

Hi All,

Please help me out with a script which checks whether a given file say abc.txt is in ASCII format and data is tab-delimited. If the condition doesn't satisfy then it should generate error code "100" for file not in ASCII format and "105" if it is not in tab-delimited format.
If the above condition satisfies it should check whether field 1 datatype and length(numeric(9)) are same or not. If not error "101" and field 2, field 3 and field 5 (which are of date data type) have data in date format or not. If the data is not in date format(yyyymmdd) or null, then it should generate an error code 112 if field 2 is not in date format or null and 113 if field 3 is not in date format or null etc., If the field is null then it should generate an error code say 150.

Data starts from 2nd line as first line contains filename,filesize and record count.

sample file: abc.txt
row 1 : abc.txt0824673850572854
row 2 : 545689512<tab>20070424<tab>20070414<tab>456.25<tab>20061121<tab>pqr
row 3 : 602584561<tab>20060726<tab>20060524<tab>800.12<tab><tab>abc
row 4 : 24<tab><tab>05242006<tab>22.15<tab>20050815<tab>xyz
.
.
.
row n : 57<tab>20040425<tab>20041214<tab>486.75<tab>20040628<tab>stv

there is a command in unix

file <file_name>

# it shows file type...

What do you mean by "ASCII format"? Do you mean a file in which no bytes have the top bit set (i.e., all are values less than 128)?

Or do you mean it only contains printable characters?

That is not a date format; that is an integer, and if it happens to contain a date, how are you supposed to tell? You should use the standard date format, YYYY-MM-DD.

awk 'BEGIN { IFS = "\t" }
 NR == 1 { next }  ## ignore first line
 !/\t/ { exit 105 }  ## line doesn't contain a tab
 length($1) != 9 || $1 ~ /[^0-9]/ { exit 101 }
  {
     n = 2
     while ( n <= NF ) {
        if ( length($n) == 0 || $n ~ /[^0-9]/ ) exit 110 + n
        ## add other tests if desired
     }
  }
}'


Thank you cfajohnson for your quick response, I'll confirm you what ASCII format means. For now I know that my script should check whether a file is in ASCII format or not. Regarding date format, my requirement is to match for data type and the length. The value I'll be getting is 20070425 and the data type is date then how do I check it? Is it not possible to check for the date data type if data comes as yyyymmdd?

Hi,
I am trying to execute the following script but I am getting error:
My requirement is to check whether the data in the file is tab delimited and pass error as say "105" to var1 and desc as "not tab delimited" to var2 and also check for the data which starts from 3rd line of the file. If the above condition satisfies it should check whether field 1 datatype and length(numeric(9)) are same or not and also whether it is null. If not var1 = "101" and var2 desc "Missing/wrong field1", field 2 datatype and length(char(9)) are same or not also for null. if not then var1 ="102" var2 "Missing/wrong field2" and so on. Any help would be appreciated.

Here is the code:
#!/bin/ksh
eval $(awk 'BEGIN { IFS = "\t" }
NR>=3 {print $1}
!/\t/ ## check whether lines contain tab else var1="105" and var2="No Tabs"
{
if ( length($1) == 0 || $1 !~ /[^0-9]/ ) ## check for null and numeric value and length(9)
then
var1="101"
var2="Missing or wrong First Field"
elif ( length($2) == 0 || $2 !~ /[a-zA-Z]/ ) ## check for null and char value and length(9)
then
var1="102"
var2="Missing or Wrong Second Field"
fi
}
}' $1)

echo "$var1"
echo "$var2"

What is the error?

If it's code, please put it inside [CODE] tags so that it is properly formatted.

What is the ouput of the awk script that you expect to eval?

In order to use eval, you need to output valid shell code.

You haven't checked that the length is 9. You have checked that it is not empty and that it doesn't contain any numbers.

That is not awk syntax.

There is no 'then', 'elif', or 'fi' keyword in awk.

You still haven't (even after the syntax is fixed) checked that the length is 9. You have checked that it is not empty and that it doesn't contain any letters.

I suggest that you start with the code I posted, and tell us what it lacks. (Reply directly to that post, quoting relevant segments.)

I am totally confused now.
Since I am a newbie and wrote the above script with the help of this forum.
I'll get a file which is tab delimited and from 3rd line onwards it has data. First field is numeric(9) not null and second field is char(8) not null, third field is numeric(9) null and fourth field is (13) not null. My requirement is first to check whether it is in tab-delimited format or not. If it is not then generate error and put it in var1 "101" and var2="Not in tab-delimited format" and if it is in tab-delimited format then check whether first field datatype and length and also for not null value, if doesn't match then var1 "110" and var2="Mismatch/Wrong Field one" if matches then check second field and put var1= "120" and var2= "Mismatch/Wrong Field two" and so on. I want to use var1 and var2 to be used for other computation. Whatever comments you have written above have gone over my head. Please help me.

You haven't answered the questions I asked, so let's start from the beginning.

(I have reformatted your post so that it is easier to understand.)

Are there only four fields?
If there are more, what conditions must they meet?
If a field has a length greater the 0, then it is not null; or do you mean something else by "not null"?

This script checks the first four fields per line.

It also gives the line number where the error occurred.

errline=$( awk 'BEGIN { IFS = "\t" }
   NR <= 2 { next } ## skip the first two lines
   !/\t/ { exit 101 }  ## line does not contain a tab

   ## Fields 1 and 3 must be 9 characters and contain only digits
   length($1) != 9 || $1 ~ /[^0-9]/ { exit 110 }
   length($3) != 9 || $1 ~ /[^0-9]/ { exit 130 }

   ## Fields 2 and 4 must be 8 and 13 characters respectively
   length($2) != 8                  { exit 120 }
   length($4) != 13                 { exit 140 }

END { print NR }
' "$FILE"
)

var1=$?  ## Set variable to the exit code of the awk script

## Assign var2 based on awk's return code
case $var1 in
   101) var2="Not in tab-delimited format" ;;
   110) var2="Mismatch/Wrong Field one" ;;
   120) var2="Mismatch/Wrong Field two" ;;
   130) var2="Mismatch/Wrong Field three" ;;
   140) var2="Mismatch/Wrong Field four" ;;
esac

printf "Error number: %d, line %d\n" "$var1" "$errline"
printf "Error message: %s\n" "$var2"


Marvellous !!
Its working perfectly.

Thank you so much. I appreciate it.

It is not checking for tab-delimited format. The following is the error I am getting, can you pl help?

Script:
#!/bin/ksh

awk -F\t 'NR>=3 { ## to start from 3rd line
!/\t/ {exit 101} ## not working
if ( $2 ~ /^ *$/ || $2 ~ /[^0-9]/ || length($1)!=7 ) {exit 102}
if ( $6 ~ /^ *$/ || $6 ~ /[^0-9]/ || length($6) !=5 ) {exit 106}
}' $1
var1=$? ## set variable to the exit code of awk script

## Assign var2 based on awk's return code
case $var1 in
101) var2="File not in tab-delimited format" ;;
102) var2="Mismatch/Wrong Field2" ;;
106) var2="Mismatch/Wrong Field6" ;;
*) var2="Success" ;;
esac
print "$var1"
print "$var2"

I am getting error :
$ test7 sample.txt
awk: syntax error near line 2
awk: illegal statement near line 2
awk: bailing out near line 3

The sampe file is:
xxyyzz20070503100717001
abcd.txt000000027600000002
1234567 3809363 175268 849036 94425 284437
2271208 3809365 175268 849036 94425 284437