AIX to RHEL migration - awk treating 0e[0-9]+ as 0 instead of string issue

chill3chee · May 14, 2017, 1:55am

Greetings Experts,
We are migrating from AIX to RHEL Linux. I have created a script to verify and report the NULLs and SPACEs in the key columns and duplicates on key combination of "|" delimited set of big files. Following is the code that was successfully running in AIX.

awk -F "|" 'BEGIN { OFS="|" ; null_blank_flag="NO"; duplicate_flag="NO"} {
if (NR == 1) # Header record, get the key column names
  {if (substr(FILENAME,length(FILENAME)-6,3) == "abc") { a_key_cols[$1 OFS $2]=$1 "," $2 }
  {if (substr(FILENAME,length(FILENAME)-6,3) == "def") { a_key_cols[$1 OFS $2 OFS $3]=$1 "," $2 "," $3}
  {if (substr(FILENAME,length(FILENAME)-6,3) == "ghi") { a_key_cols[$1 OFS $2 OFS $3 OFS $4]=$1 "," $2 "," $3 "," $4}
}
if (NR >= 2)
{  if (substr(FILENAME,length(FILENAME)-6,3) == "abc") { a[$1 OFS $2]++ }
  {if (substr(FILENAME,length(FILENAME)-6,3) == "def") { a[$1 OFS $2 OFS $3]++ }
  {if (substr(FILENAME,length(FILENAME)-6,3) == "ghi") { a[$1 OFS $2 OFS $3 OFS $4]++}
}
}
END { for (i in a) { n=split(i,arry,OFS);
for (k=1;k<=n;k++) { 
gsub(" ","",arry[k]);
if ( (! arry[k]) && ( arry[k] != "0" )) { null_blank_flag="YES" } }
if ( a >= 2 ) { duplicate_flag="YES" }
print "Filename " FILENAME " null/blank flag: " null_blank_flag " and duplicate flag: " duplicate_flag
} ' file_name_*.txt

I had to use the condition arry[k] != "0" because as AWK is treating 0 as NULL and it may be valid value in the file. Please note that the key values change depending on the file name and the key column has SHA_encrypted account numbers like 02djfdf93ikdkjdfkdf3 and 0e123458939393 etc. Please ignore any syntax issues in script as I am not able to copy/paste the working code as it in another machine and the number of characters in the encrypted account_number

When ran this script in Linux, the encrypted account numbers which follows the format 0e[0-9]+ is being interpreted as scientific notation as 0*10 power of [0-9]+ where as in AIX this is interpreted as string itself and doesn't set the null_blank_flag to yes.

**AIX**
echo "abc|0e123456789|xyz|1234|kdkd|dfs" | awk -F "|" '{ if ( ! $2 ) { print $2 " is empty or NULL " } else { print $2 " is not empty "}}'
output: 0e123456789 is not empty

**RHEL LINUX**
echo "abc|0e123456789|xyz|1234|kdkd|dfs" | awk -F "|" '{ if ( ! $2 ) { print $2 " is empty or NULL " } else { print $2 " is not empty "}}'
output: 0e123456789 is empty or NULL

grep may be a choice and I think I need to read the file twice to check for the null/blank and duplicates on the keys and hence resorted to awk.

As a[$1 OFS $2 OFS $3 OFS $4]++ array index is string for most of the data in the file, I had expected the array index that was built will also be string $1 OFS $2 OFS $3 OFS $4 . Will it change from string to integers even though the first record file entry after header is string. Will the strings convert to integer during the split n=split(i,arry,OFS); ...; arry[k] The interesting part is the output from the Linux version is also 0e123456789 which confirms that $2 is 0e123456789 . Now where is the issue occurring which transforms $2 value to 0

Can you please explain how to make awk/shell to interpret 0e[0-9]+ as string instead of number. Thank you for your time.

RudiC · May 14, 2017, 2:20am

Try a string conversion:

echo "abc|0e123456789|xyz|1234|kdkd|dfs" | awk -F "|" '{ if ( ! ($2 "")) { print $2 " is empty or NULL " } else { print $2 " is not empty "}}'
0e123456789 is not empty

But - no integer conversion is done for sheer assignments or usage in e.g. constucting an array index .

MadeInGermany · May 14, 2017, 3:15am

Comparing with a string should work as well.

echo "abc|0e123456789|xyz|1234|kdkd|dfs" | awk -F "|" '{ if ($2=="") { print $2 " is empty or NULL " } else { print $2 " is not empty "}}'

chill3chee · May 14, 2017, 3:18am

Thank you RudiC and MadeInGermany. That works. Just wanted to know that during reading $2 is 0e123456789 but why does awk interpret $2 to 0 i.e., process $2 as scientific notation during ! $2 in earlier code. Can you please explain this behavior. Is this again specific to Linux as this was executing without issues in AIX. Thank you for your time.

RudiC · May 14, 2017, 3:26am

As I said - usage alone, be it for assignment or reference in an index, takes the value as is. Only evaluation, e.g. for a boolean expression or a numerical computation, converts the string to a number, using the starting characters up to a non-convertible one.

MadeInGermany · May 14, 2017, 5:17am

awk has automatic type conversion.
The ! operator is a boolean that is rather a number (being 1 or 0) than a string, so there is a decent hint to treat the variable as a number.
In border cases there can be differences between awk versions. E.g. the awk in AIX is a derivate from nawk, where awk in Linux is mostly gawk, but some Linux have mawk that is again little different.
The bottom line is, make the data type clear, e.g. typical casts are
var "" append a null string, result is a string
var+0 add a zero, result is a number