Grep and substitute?

alan · July 18, 2014, 3:26pm

I have to parse ASCII files, output the relevant data to a comma-delimited file and load it into a database table.

The specs for the file format have been recently updated and one section is causing problems. This is the original layout for that section.

    CSVHeaderAttr:PUIS,IdleImmediate,POH,Temp,WorstTemp
    CSVValuesAttr:NO,NO,9814,31,56

I parse it with `grep` thusly

    CSVAttributes=$(grep ^CSVValuesAttr:  ${filename}|cut -d':' -f2)
    [ -z "$CSVAttributes" ] && CSVAttributes="NA"

It works great but now that the section has new fields and they are named differently

CSVHeaderAttr:PUIS,IdleImmediateSupported,IdleImmediateEnabled,POH,Temp,WorstTemp
    CSVValuesAttr:NO,YES,YES,23861,31,51

Right now, I am grepping the files based on their layout (there is a field in the the header which tells me the version of the layout) to two different comma-delimited files and load them into two different tables. I would like to output both sections to the same file so the data scientist only has one table to use in his analysis.

Is there a way to use grep to produce an output like this and substitute empty fields with NA?

For one file type:

CSVHeaderAttr:PUIS,IdleImmediate,IdleImmediateSupported,IdleImmediateEnabled,POH,Temp,WorstTemp
    CSVValuesAttr:NO,NO,NA,NA,9814,31,56

For the other file type:

CSVHeaderAttr:PUIS,IdleImmediate,IdleImmediateSupported,IdleImmediateEnabled,POH,Temp,WorstTemp
    CSVValuesAttr:NO,NA,YES,YES,23861,31,51

Thanks for your input.

Corona688 · July 18, 2014, 4:05pm

When the question is "grep and", awk is usually the answer. It has a language statement especially built for 'if this line matches this regex, do x'.

This program checks what fields belong where depending on the header in the file you give it, and the header you give it in OUT=. This lets you use the exact same program on either format.

$ awk '(NR==1) { for(N=1; N<=NF; N++){ F[$N]=N ; F[N]=$N } ; next }
{ for(N=1; N<=NF; N++) D[F[N]]=$N ; next }
END {
        print OUT;
        split(OUT, OF);
        for(N=1; N in OF; N++) 
                if(OF[N] in D) $N=D[OF[N]] else $N="NA";
	print;
}' FS="," OFS="," OUT="CSVHeaderAttr:PUIS,IdleImmediate,IdleImmediateSupported,IdleImmediateEnabled,POH,Temp,WorstTemp" oldformat

CSVHeaderAttr:PUIS,IdleImmediate,IdleImmediateSupported,IdleImmediateEnabled,POH,Temp,WorstTemp
    CSVValuesAttr:NO,NO,NA,NA,9814,31,56

$ awk '(NR==1) { for(N=1; N<=NF; N++){ F[$N]=N ; F[N]=$N } ; next }
{ for(N=1; N<=NF; N++) D[F[N]]=$N ; next }
END {
        print OUT;
        split(OUT, OF);
        for(N=1; N in OF; N++)
        {
                $N="NA";
                if(OF[N] in D) $N=D[OF[N]];
        }
print;
}' FS="," OFS="," OUT="CSVHeaderAttr:PUIS,IdleImmediate,IdleImmediateSupported,IdleImmediateEnabled,POH,Temp,WorstTemp" newformat

CSVHeaderAttr:PUIS,IdleImmediate,IdleImmediateSupported,IdleImmediateEnabled,POH,Temp,WorstTemp
    CSVValuesAttr:NO,NA,YES,YES,23861,31,51

$

RudiC · July 19, 2014, 1:52pm

If your samples are representative, this might do the job:

awk     'FNR==1 {print "CSVHeaderAttr:PUIS,IdleImmediate,IdleImmediateSupported,IdleImmediateEnabled,POH,Temp,WorstTemp"; TYPE=7-NF; next}
                {printf "%s:%s,%s,%s,%s,%s,%s\n", $1, $2, TYPE?$3:"NA", TYPE?"NA,NA":$3","$4, $(NF-2), $(NF-1), $NF }
        ' FS="[:,]"  file[12]
CSVHeaderAttr:PUIS,IdleImmediate,IdleImmediateSupported,IdleImmediateEnabled,POH,Temp,WorstTemp
CSVValuesAttr:NO,NO,NA,NA,9814,31,56
CSVHeaderAttr:PUIS,IdleImmediate,IdleImmediateSupported,IdleImmediateEnabled,POH,Temp,WorstTemp
CSVValuesAttr:NO,NA,YES,YES,23861,31,51