Bash to verify and validate file header and data type

cmccabe · April 16, 2017, 9:54am

The below bash is a file validation check executed that will verify the correct header count of 10 and the correct data type in each field of the tab-delimited file . The key has the data type of each field in it. My real data has 58 headers in it but only the header and next row need to be checked. The below files are examples that have all possible data types in them. That is the data type of each line after the header is the same as the line above it. All lines will have some sort of data in it, either a numeric, alpha charter or a . (dot) for a null value. If the file is validate a message is written to the output indicated this, else the missing header or bad data type is written to output.
I'm not sure if the below is the best way to do this, but hopefully it is close. Each line is commented as to what I think is happening. Thank you :).

There are 3 example files represent each of the only possibilities.

file1  --- is a good file, validated for both header and data type in all fields in file1
file2  --- is a bad file, not validated though the header line is good, the data type expected in QUAL is alpha and it is a .(dot) in red in file2
file3  --- is a bad file, not validated though the header line is not good (10 columns are expected), though the data type expected in file3

key

Index    Chr    Start    End    Ref    Alt    Freq    Qual    Score    Input    ---- defined 10 column headers ----
Integar     Integar    Integar    Integar    Alpha    Alpha    Integar    Alpha    Integar    Integar   --- data type of each line after header  ----

file1

Index    Chr    Start    End    Ref    Alt    Freq    Qual    Score    Input
1    1    1    100    C    -    1    GOOD    10    .
2    2    20    200    A    C    .002    STRAND BIAS    2    .
3    2    270    400    -    GG    .036    GOOD    6    .

file2

Index    Chr    Start    End    Ref    Alt    Freq    Qual    Score    Input
1    1    1    100    C    -    1    .    10    .
2    2    20    200    A    C    .002    STRAND BIAS    2    .
3    2    270    400    -    GG    .036    GOOD    6    .

file3

Index    Chr    Start    End    Ref    Alt    Freq    Qual    Input
1    1    1    100    C    -    1    GOOD    10    .
2    2    20    200    A    C    .002    STRAND BIAS    2    .
3    2    270    400    -    GG    .036    GOOD    6    .

#!/bin/bash# call bash script
awk -F'\t' '{print NF, "fields detected in file and they are:" ORS $0; exit}' file >> output  # detect header row in file and store in output
   if [[ $NF -eq 1 ]]; then   # display results
      echo "file has expected number of fields"   # file is validated for headers
    else
      echo "file is missing header for:"  # missing header field ...in file not-validated
      echo "$NF"
    fi  # close if.... else    
    
isnumeric()   # numeric function
{   # start block
    result=$(echo "$1" | tr -d '[[:digit:]]')  # check each field in file for numeric and store result
    echo ${#result}   # display result
}  # end block

isalpha()   # charcter function
{  # start block
    result=$(echo "$1" | tr -d '[[:alpha:]]')  # check each field in file for character and store result
    echo ${#result}   # display result
}  # end block
col1=""   # define col to search
col2=""   # define col to search
col3=""   # define col to search
col4=""   # define col to search
col5=""   # define col to search
col6=""   # define col to search
col7=""   # define col to search
col8=""   # define col to search
col9=""    # define col to search
col10=""  # define col to search
let retval=1  # data to check in this row

while read record  # start loop to read each column in file
do
    echo "$record" | awk -F'\t' '{print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10 }' | read col1 col2 col3 col4 col5 col6 col7 col8 col col10  # store in col name in record
    
    # check  if numeric in col
    if [[ $(isnumeric "$col1") -eq 1 && $(isnumeric "$col2") -eq 1 && $(isnumeric "$col3") -eq 1 && $(isnumeric "$col4") -eq 1 && $(isnumeric "$col7") -eq 1 && $(isnumeric "$col9") -eq 1 && $(isnumeric "$col10") -eq 1 ]]; then
         retval=1  # check data in this row
    else
         retval=0  # go back to header row
         break
    fi  # close if.... else
    
    # check if alpha in col
    if [[ $(isalpha "$col5") -eq 1 && $(isalpha "$col6") -eq 1 && $(isalpha "$col8") -eq 1 ]]; then
         retval=1  # check data in this row
    else
         retval=0  # go back to header row
         break
    fi  # close if....else
    
    if [[ $retval -eq 1 ]]; then   # display results
      echo "file is correct data type in each field"   # file isvalidated
    else
      echo "file is  not the correct data type for:"  # colums ...in file not-validated
      echo "$col1 $col2 $col3 $col4 $col5 $col6 $col7 $col8 $col9 $col10"
    fi  # close if.... else    
    
    if [[ NF == 10 && $retval -eq 1 ]]; then   # execute and display file validated
      echo "file is validated"
    else
      echo "file is not validated"
    fi
done  < file >> output  # end loop and define file to check and add to output

rovf · April 17, 2017, 3:13am

if [[ NF == 10 && $retval -eq 1 ]]

will always evaluate to false, because the constrant string NF is not equal to the constant string 10.

drl · April 17, 2017, 7:49am

Hi.

After fixing the syntax error in:

isnumeric()   3 numeric function

shellcheck then provided messages (see end of long line):

In z2 line 38:
    if [[ $(isnumeric "$col1") -eq 1 && $(isnumeric "$col2") -eq 1 && $(isnumeric "$col3") -eq 1 && $(isnumeric "$col4") -eq 1 && $(isnumeric "$col7") -eq 1 && $(isnumeric "$col9") -eq 1 && $(isnumeric "$col10") -eq 1 &&]]; then
    ^-- SC1009: The mentioned parser error was in this if expression.
       ^-- SC1073: Couldn't parse this test expression.
                                                                                                                                                                                                                             ^-- SC1072: Unexpected keyword/token. Fix any mentioned problems and try again.

Details for shellcheck:

shellcheck      analyse shell scripts (man)
Path    : /usr/bin/shellcheck
Version : ShellCheck - shell script analysis tool
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYSV ...)
Help    : probably available with -h
Repo    : Debian 8.7 (jessie) 
Home    : http://hackage.haskell.org/package/ShellCheck

Best wishes ... cheers, drl

cmccabe · April 17, 2017, 8:17am

I fixed the syntax errors and the script does execute but I get. I updated the post with the changes as well.

displayed in terminal:

file is missing header for: then the script ends. Thank you :).

output created in directory

1 fields detected in file and they are:
Index    Chr    Start    End    Ref    Alt    Freq    Qual    Score    Input

drl · April 17, 2017, 8:45am

Hi.

Aside from the fact that the script has many lines, some of the lines are long. I tend to look at code if it fits within the width of a page, without me needing to scroll horizontally. Shells are very good at being able to continue pipelines. Other code can have lines terminated with \ to escape the newline. I think that aids comprehension and maintainability.

So without looking at your script in any detail, the next thing I would try is placing set -x in the script. You could also place intermediate printf/echo statements at crucial spots in your script. I use functions to turn on/off debugging output.

You could place set -x at the beginning to see everything. You could place it near the middle of the code, and then bisect the placement depending on whether you see something wrong or not.

Keep in mind that there could be more than error.

Best wishes ... cheers, drl

Don_Cragun · April 17, 2017, 7:22pm

You also have a one line awk script followed by an if statement that is using awk variables instead of bash variables. Since NF has not been defined in your bash code, the test in that if statement will also always evaluate to false.

cmccabe · April 17, 2017, 8:32pm

The first portion of the bash in bold verifies the headers in each text file in dir and creates 2 out files, one for each unique file. That seems to be working perfectly.

The second portion of the bash is to test and verify each data type. The script executes but the data type in each field is not verified, only the headers are verified.

The key also tab-delimited has the defined headers and data type of each field.

Only the header line and line under that need to be verified as all files in the dir will have the same format of each. Thank you :).

file1 tab-delimited

Index   Chr Start   End Ref Alt Freq    Qual    Score   Input   ---- this file is verified with 10 headers and the data type in each field is good
1    1    1    100    C    -    1    GOOD    10    .
2    2    20    200    A    C    .002    STRAND BIAS    2    .
3    2    270    400    -    GG    .036    GOOD    6    .

file2 tab-delimited

Index   Chr Start   End Ref Alt Freq    Qual    Score    Input --- this file is verified with 10 headers but not verified as the red . in QUAL should be "GOOD" or alpha
1    1    1    100    C    -    1    .   10    .
2    2    20    200    A    C    .002    STRAND BIAS    2    .
3    2    270    400    -    GG    .036    GOOD    6    .

key

Index    Chr    Start    End    Ref    Alt    Freq    Qual    Score    Input    ---- defined 10 column headers ----
Integar     Integar    Integar    Integar    Alpha    Alpha    Integar    Alpha    Integar    Integar   --- data type of each line after header  ----

the ---- are nor part of each file, only there to help in the description

Bash

#!/bin/bash

dir="/home/cmccabe/bash"   # directory to search for files
for f in "$dir"/*.txt; do   # start for loop
bname=`basename $f`    # strip off path
pref=${bname%%.txt}    # strip of path and extention from output
awk '
FNR==NR {  # process all columns and rows in file
    for(n=1;n<=NF;n++)   # iterate through  each
        a[$n]  # store inarray n
    nextfile   # next file
}
NF==(n-1) {  # define NF
    print FILENAME " file has expected number of fields"   # Good file
    nextfile   # next file
}
{
    for(i=1;i<=NF;i++)  # iterate through headers
        b[$i]   # header lines
    print FILENAME " is missing header for: "   # Bad file
    for(i in a)   # read headers into i
    if(i in b==0)  # if can not find header in key
        print i    # print missing header
    nextfile  
}' /home/cmccabe/bash/key $f > /home/cmccabe/bash/${pref}_out # use key as headers to look for in files and create out for each
done

isnumeric()   # numeric function
{   # start block
    result=$(echo "$1" | tr -d '[[:digit:]]')  # check each field in file for numeric and store result
    echo ${#result}   # display result
}  # end block

isalpha()   # charcter function
{  # start block
    result=$(echo "$1" | tr -d '[[:alpha:]]')  # check each field in file for character and store result
    echo ${#result}   # display result
}  # end block
col1=""   # define col to search
col2=""   # define col to search
col3=""   # define col to search
col4=""   # define col to search
col5=""   # define col to search
col6=""   # define col to search
col7=""   # define col to search
col8=""   # define col to search
col9=""    # define col to search
col10=""  # define col to search
let retval=1  # data to check in this row

while read record  # start loop to read each column in file
do
    echo "$record" | awk -F'\t' '{print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10 }' | read col1 col2 col3 col4 col5 col6 col7 col8 col col10  # store in col name in record
    
    # check  if numeric in col
    if [[ $(isnumeric "$col1") -eq 1 && $(isnumeric "$col2") -eq 1 && $(isnumeric "$col3") -eq 1 && $(isnumeric "$col4") -eq 1 && $(isnumeric "$col7") -eq 1 && $(isnumeric "$col9") -eq 1 && $(isnumeric "$col10") -eq 1 ]]; then
         retval=1  # check data in this row
    else
         retval=0  # go back to header row
         break
    fi  # close if.... else
    
    # check if alpha in col
    if [[ $(isalpha "$col5") -eq 1 && $(isalpha "$col6") -eq 1 && $(isalpha "$col8") -eq 1 ]]; then
         retval=1  # check data in this row
    else
         retval=0  # go back to header row
         break
    fi  # close if....else
    
    if [[ $retval -eq 1 ]]; then   # display results
      echo "file is correct data type in each field"   # file isvalidated
    else
      echo "file is  not the correct data type for:"  # colums ...in file not-validated
      echo "$col1 $col2 $col3 $col4 $col5 $col6 $col7 $col8 $col9 $col10"
    fi  # close if.... else    
    
    if [[ NF == 10 && $retval -eq 1 ]]; then   # execute and display file validated
      echo "$f is validated"
    else
      echo "$f is not validated"
    fi
done  < $f >> /home/cmccabe/bash/${pref}_out  # end loop and define file to check and add to output

desired out ---- one for each file

/home/cmccabe/bash/file1.txt file has expected number of fields
/home/cmccabe/bash/file1.txt is validated
/home/cmccabe/bash/file1.txt is correct data type in each field

/home/cmccabe/bash/file2.txt has the expected number of fields
/home/cmccabe/bash/file2.txt is not the correct data type for: QUAL
/home/cmccabe/bash/file2.txt is not validated