KSH Expression

tmarikle · September 28, 2005, 3:02pm

Quick question related to KSH expressions (not unix regular expressions).

I am trying to craft a pattern that will correctly identify lines that match the following CSV text in a case statement:

filename.txt, filename.txt, alpha, nnnn, nnnn, nnnn, Free form text

Originally I simply used an expression like *,*,*,*,*,*,* in the following case statement:

case ${LINE} in
    # Expression 1..n are informational and specific enough that the
    # expressions work well
    expression 1..n)
             ... match expressions 1..n logic ... ;;

    # CSV lines contain 7 fields and 6 commas
    *,*,*,*,*,*,*)
         ... match valid CSV line logic ... ;;

    # Malformed CSV lines or any other not matching my list of expressions
    *)
         ... malformed CSV line or other mismatch ... ;;
esac

Problem:
I found that the *,*,*,*,*,*,* CSV expression matches cases such as these:

field1, field2, field3, field4, field5, field6, field7, field8, field9
field1, field2, field3, field4, field5, field6
field1, field2, field3, field4, field5, field6, field7,,,,,,,
,field1, field2, field3, field4, field5, field6, field7

I have tried numerous variations and have ended up with this expression:

case ...
...
    @(*)@(,)@(*) ) ...
...
esac

I can match more precisely and this nails the smallest CSV list of "text, text" but I still have to incorporate some comma counting logic that I don't want to include.

The commas and/or asterisks are causing me complications with various expressions that I have tried (essentially * matches commas). Production code is very hard to change where I work once implemented so I'd like to nail down a very precise expression now and let the final *) expression trap all malformed lines. What am I doing wrong?

By the way, I have no control of the data file provided me so changes to my data source won't happen.

Perderabo · September 28, 2005, 3:14pm

Post some sample valid data.

tmarikle · September 28, 2005, 3:20pm

file1.txt, file1_original_name.txt, control1, 1001, 100001, 10000, Data Sample 1
file2.txt, file2_original_name.txt, control5, 2001, 100002, 10000, Data Sample 2
file3.txt, file3_original_name.txt, control7, 3001, 100003, 20000, Data Sample 3

Perderabo · September 28, 2005, 3:36pm

Sheesh....

#! /usr/bin/ksh

exec < data
while read line ; do
        case $line in
        +([_a-zA-Z0-9]).txt,+( )+([_a-zA-Z0-9]).txt,+( )+([a-zA-Z0-9]),+( )+([0-9]),+( )+([0-9]),+( )+([0-9]),* ) echo OK: $line
                ;;
        *)   echo XX: $line
                ;;
        esac
done
exit 0

tmarikle · September 28, 2005, 4:27pm

Excellent. This gets me very close.

Lines such as this still get through due to the final "*":

file1.txt, file1_original_name.txt, control1, 1001, 100001, 10000, Data Sample 1,,,,,

Altering the expression helps narrow it down:

+([_a-zA-Z0-9]).txt,+( )+([_a-zA-Z0-9]).txt,+( )+([a-zA-Z0-9]),+( )+([0-9]),+( )+([0-9]),+( )+([0-9]),+( )+([ a-zA-Z0-9]))

and I think this allows for anything but a comma in the final field:

+([_a-zA-Z0-9]).txt,+( )+([_a-zA-Z0-9]).txt,+( )+([a-zA-Z0-9]),+( )+([0-9]),+( )+([0-9]),+( )+([0-9]),+( )+([!,])

I need to tweak the filenames a bit but I believe I have the basis for nailing down a very precise expression.

Thanks for the help!

Thomas