The below bash
is a file validation check executed that will verify the correct header count of 10
and the correct data type in each field of the tab-delimited file
. The key
has the data type of each field in it. My real data has 58
headers in it but only the header and next row need to be checked. The below files are examples that have all possible data types in them. That is the data type of each line after the header is the same as the line above it. All lines will have some sort of data in it, either a numeric, alpha charter or a .
(dot) for a null value. If the file is validate a message is written to the output indicated this, else the missing header or bad data type is written to output.
I'm not sure if the below is the best way to do this, but hopefully it is close. Each line is commented as to what I think is happening. Thank you :).
There are 3 example files represent each of the only possibilities.
file1 --- is a good file, validated for both header and data type in all fields in file1
file2 --- is a bad file, not validated though the header line is good, the data type expected in QUAL is alpha and it is a .(dot) in red in file2
file3 --- is a bad file, not validated though the header line is not good (10 columns are expected), though the data type expected in file3
key
Index Chr Start End Ref Alt Freq Qual Score Input ---- defined 10 column headers ----
Integar Integar Integar Integar Alpha Alpha Integar Alpha Integar Integar --- data type of each line after header ----
file1
Index Chr Start End Ref Alt Freq Qual Score Input
1 1 1 100 C - 1 GOOD 10 .
2 2 20 200 A C .002 STRAND BIAS 2 .
3 2 270 400 - GG .036 GOOD 6 .
file2
Index Chr Start End Ref Alt Freq Qual Score Input
1 1 1 100 C - 1 . 10 .
2 2 20 200 A C .002 STRAND BIAS 2 .
3 2 270 400 - GG .036 GOOD 6 .
file3
Index Chr Start End Ref Alt Freq Qual Input
1 1 1 100 C - 1 GOOD 10 .
2 2 20 200 A C .002 STRAND BIAS 2 .
3 2 270 400 - GG .036 GOOD 6 .
#!/bin/bash# call bash script
awk -F'\t' '{print NF, "fields detected in file and they are:" ORS $0; exit}' file >> output # detect header row in file and store in output
if [[ $NF -eq 1 ]]; then # display results
echo "file has expected number of fields" # file is validated for headers
else
echo "file is missing header for:" # missing header field ...in file not-validated
echo "$NF"
fi # close if.... else
isnumeric() # numeric function
{ # start block
result=$(echo "$1" | tr -d '[[:digit:]]') # check each field in file for numeric and store result
echo ${#result} # display result
} # end block
isalpha() # charcter function
{ # start block
result=$(echo "$1" | tr -d '[[:alpha:]]') # check each field in file for character and store result
echo ${#result} # display result
} # end block
col1="" # define col to search
col2="" # define col to search
col3="" # define col to search
col4="" # define col to search
col5="" # define col to search
col6="" # define col to search
col7="" # define col to search
col8="" # define col to search
col9="" # define col to search
col10="" # define col to search
let retval=1 # data to check in this row
while read record # start loop to read each column in file
do
echo "$record" | awk -F'\t' '{print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10 }' | read col1 col2 col3 col4 col5 col6 col7 col8 col col10 # store in col name in record
# check if numeric in col
if [[ $(isnumeric "$col1") -eq 1 && $(isnumeric "$col2") -eq 1 && $(isnumeric "$col3") -eq 1 && $(isnumeric "$col4") -eq 1 && $(isnumeric "$col7") -eq 1 && $(isnumeric "$col9") -eq 1 && $(isnumeric "$col10") -eq 1 ]]; then
retval=1 # check data in this row
else
retval=0 # go back to header row
break
fi # close if.... else
# check if alpha in col
if [[ $(isalpha "$col5") -eq 1 && $(isalpha "$col6") -eq 1 && $(isalpha "$col8") -eq 1 ]]; then
retval=1 # check data in this row
else
retval=0 # go back to header row
break
fi # close if....else
if [[ $retval -eq 1 ]]; then # display results
echo "file is correct data type in each field" # file isvalidated
else
echo "file is not the correct data type for:" # colums ...in file not-validated
echo "$col1 $col2 $col3 $col4 $col5 $col6 $col7 $col8 $col9 $col10"
fi # close if.... else
if [[ NF == 10 && $retval -eq 1 ]]; then # execute and display file validated
echo "file is validated"
else
echo "file is not validated"
fi
done < file >> output # end loop and define file to check and add to output