parsing data for certain conditions

PAW · June 5, 2009, 5:51am

Hi guys,

I have got this working OK but I am sure there is a more efficient/elegant way of doing it, which I hope you can help me with.

It can be done in whatever is most suitable i.e perl/awk..

Any suggestions are welcome and many thanks in advance.

What I require is to extract the first field using " as the FS upto the last . in that field. Sometimes there are several . in that field.
The second field is from the last . to the first "
The third field is from the first " to the | removing spaces.

This output is only required if the third field using the " as FS is blank, and the second field upto the | has data present.

Below is an example of all variants of the data I have in a file 800000+ rows.

This is the output using the above input.

#!/bin/bash
IFS='"'
while read line
do
test1=`echo "$line" | awk -F'"' '{print $1}'`
test2=`echo "$line" | awk -F '[|]' '{print $(NF-1)}' | awk -F'"' 'BEGIN {OFS=","} {print $2}'|awk '{$1=$1;print}'`
test3=`echo "$line" | awk -F'"' '{print $3}'|awk '{$1=$1;print}'`
        if [[ -n "${test2}" && -z "${test3}" ]]; then
        FID=`echo  "${test1}"|awk -F"." '{ gsub(/-/,"",$0); for ( i = NF; i > 0; i-- ) printf("%s ",$i); printf("\n");}'| awk -F" " '{print $1}'`
        RIC=`echo  "${test1}"|sed -e 's/'.${FID}'//g'`
        echo "$RIC , $FID , $test2" >> philout
        else
        echo "false"
        fi
done < head_out_orig_phil

Cheers Phil.

cfajohnson · June 5, 2009, 7:47am

You are calling awk 6,400,000+ times, and sed 800,000+ times.

With 800000+ rows, you need awk, but you only need one call to awk, not eight (including one that does nothing) and one to sed for every line of the file.

Here's a start to an awk script:

awk -F'"' '
    {
     test1 = $1
     fields = split($0,a,"|")
     test2 = a[fields - 1]
     test3 = $3

     if ( length(test2) > 0 && length(test3) == 0 ) ...
}
' head_out_orig_phil

PAW · June 5, 2009, 8:58am

OK thanks, Ill give it a go.

durden_tyler · June 5, 2009, 9:12am

Here's one way to do it in perl:

$
$ cat data.txt
CH0045775191=UBSL.RDN_EXCHD2 " | CH0045775191=UBSL.RDN_EXCHD2 "phil
CH0045775191=UBSL.TILE_DESC " | CH0045775191=UBSL.TILE_DESC "
CH0024226190=UBSL.ISSUE_DATE " | CH0024226190=UBSL.ISSUE_DATE "
CH0024226190=UBSL.CONV_TEXT "G VANKE | CH0024226190=UBSL.CONV_TEXT "
CH0024226190=UBSL.GEN_VAL1 "+16.56 | CH0024226190=UBSL.GEN_VAL1 "J0shua
CH0032678747.UBS.MKT_MKR_NM "govindva | CH0032678747.UBS.MKT_MKR_NM "
$
$
$
$ perl -ne 'split/["\|]/;
>  if ($_[3] =~ /^\s*$/ && $_[1] !~ /^\s*$/ && $_[0] =~ /^(.*)\.([^.]*?) /) {
>   print "$1 , $2 , $_[1]\n" }' data.txt
CH0024226190=UBSL , CONV_TEXT , G VANKE
CH0032678747.UBS , MKT_MKR_NM , govindva
$
$

tyler_durden

PAW · June 5, 2009, 9:25am

Thanks for contribution Tyler.

PAW · June 18, 2009, 9:10am

Hi guys,

OK, the Awk script has a problem whereby it is providing an output from the test when there is no characters so I am presuming it is spaces/tabs.
Can you help with this?

#!/bin/bash
awk -F'"' '
    {
     test1 = $1
     test2 = $2
     fields = split(test2,a,"|")
     test4 = a[fields - 1]
     test3 = $3
     if ( length(test4) > 0 && length(test3) == 0 )   print test4 ; else print "
fail"
}
' head_out_orig_phil

Output

Using the original input file.

With the perl script it works OK, apart from on a couple of lines it fails due to a line with the highlighted character. Have you any ideas and if you could put some comments regards this script I would appreciate it. My perl is not too good.

CH0042237526=UBSL.GNTXT14_5 "CH0042237526�                  | CH0042237526=UBSL.GNTXT14_5 "

Many thanks for your help

Phil.