Greetings Experts,
We are migrating from AIX to RHEL Linux. I have created a script to verify and report the NULLs and SPACEs in the key columns and duplicates on key combination of "|" delimited set of big files. Following is the code that was successfully running in AIX.
awk -F "|" 'BEGIN { OFS="|" ; null_blank_flag="NO"; duplicate_flag="NO"} {
if (NR == 1) # Header record, get the key column names
{if (substr(FILENAME,length(FILENAME)-6,3) == "abc") { a_key_cols[$1 OFS $2]=$1 "," $2 }
{if (substr(FILENAME,length(FILENAME)-6,3) == "def") { a_key_cols[$1 OFS $2 OFS $3]=$1 "," $2 "," $3}
{if (substr(FILENAME,length(FILENAME)-6,3) == "ghi") { a_key_cols[$1 OFS $2 OFS $3 OFS $4]=$1 "," $2 "," $3 "," $4}
}
if (NR >= 2)
{ if (substr(FILENAME,length(FILENAME)-6,3) == "abc") { a[$1 OFS $2]++ }
{if (substr(FILENAME,length(FILENAME)-6,3) == "def") { a[$1 OFS $2 OFS $3]++ }
{if (substr(FILENAME,length(FILENAME)-6,3) == "ghi") { a[$1 OFS $2 OFS $3 OFS $4]++}
}
}
END { for (i in a) { n=split(i,arry,OFS);
for (k=1;k<=n;k++) {
gsub(" ","",arry[k]);
if ( (! arry[k]) && ( arry[k] != "0" )) { null_blank_flag="YES" } }
if ( a >= 2 ) { duplicate_flag="YES" }
print "Filename " FILENAME " null/blank flag: " null_blank_flag " and duplicate flag: " duplicate_flag
} ' file_name_*.txt
I had to use the condition arry[k] != "0"
because as AWK is treating 0 as NULL and it may be valid value in the file. Please note that the key values change depending on the file name and the key column has SHA_encrypted account numbers like 02djfdf93ikdkjdfkdf3
and 0e123458939393
etc. Please ignore any syntax issues in script as I am not able to copy/paste the working code as it in another machine and the number of characters in the encrypted account_number
When ran this script in Linux, the encrypted account numbers which follows the format 0e[0-9]+
is being interpreted as scientific notation as 0*10 power of [0-9]+
where as in AIX this is interpreted as string itself and doesn't set the null_blank_flag to yes.
**AIX**
echo "abc|0e123456789|xyz|1234|kdkd|dfs" | awk -F "|" '{ if ( ! $2 ) { print $2 " is empty or NULL " } else { print $2 " is not empty "}}'
output: 0e123456789 is not empty
**RHEL LINUX**
echo "abc|0e123456789|xyz|1234|kdkd|dfs" | awk -F "|" '{ if ( ! $2 ) { print $2 " is empty or NULL " } else { print $2 " is not empty "}}'
output: 0e123456789 is empty or NULL
grep
may be a choice and I think I need to read the file twice to check for the null/blank and duplicates on the keys and hence resorted to awk.
As a[$1 OFS $2 OFS $3 OFS $4]++
array index is string for most of the data in the file, I had expected the array index that was built will also be string $1 OFS $2 OFS $3 OFS $4
. Will it change from string to integers even though the first record file entry after header is string. Will the strings convert to integer during the split n=split(i,arry,OFS); ...; arry[k]
The interesting part is the output from the Linux version is also 0e123456789 which confirms that $2 is 0e123456789
. Now where is the issue occurring which transforms $2 value to 0
Can you please explain how to make awk/shell to interpret 0e[0-9]+
as string instead of number. Thank you for your time.