Find records with specific characters in 2 nd field

ashwin3086 · August 30, 2018, 1:16pm

Hi ,
I have a requirement to read a file ( 5 fields , ~ delimited) and find the records which contain anything other than Alphabets, Numbers , comma ,space and dot . ie a-z and A-Z and 0-9 and . and " " and , in 2nd field. Once I do that i would want the result to have field1|<flag>

flag can be Y or N .

N - If 2nd field doesnt have anything other above mentioned characters.
Else Y .

I am able to achieve this using below code by reading line by line . Please note second field is "address".

#!/bin/ksh
 rm -f ca_sc_flag.txt 
while read rec
do

cust_id=`echo $rec | cut -d'~' -f1`
addr=`echo $rec | cut -d'~' -f2`


addr_rem=`echo ${addr}|tr -d 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789,. '`

if [ -z "${addr_rem}" ]; then
  
    sc='N'
    echo "$cust_id|$sc" >> ca_sc_flag.txt   

else
         sc='Y'
     echo "$cust_id|$sc" >> ca_sc_flag.txt   

fi
done < ca.txt

The issue is it is very ineffective and takes almost 30 mins for 100K records. Can I improve it by using better logic. May be by avoiding reading line by line.

RudiC · August 30, 2018, 5:47pm

Don't use shell loops to process that large text files. Try text tools like perl , awk , or other. awk example (untested, as I am on a (yecc) windows laptop):

awk -F"~" '
{print $1 "|" ($2 ~ /[^A-Za-z0-9., ]/)?"Y":"N"
}
' ca.txt

ashwin3086 · August 31, 2018, 4:13pm

Thanks for the input Rudy.
It did point in the right direction. Not sure why the ?Yes:No didnt work. I changed the code to use if else loop and it worked.

awk -F"|" '{ if ($2 ~ /[^0-9a-zA-Z,. ]/) print $0,"|Has other chars";else print $0,"|Only Valid Chars"}' test.txt