Parsing 286 length Character string

ppat7046 · April 21, 2009, 5:02pm

Hi Friends,

I have .txt file which has 13000 records.
Each record is 278 character long.

I am using below code to extract the string and it takes almost 10 minutes.
Any suggestion please.

cat filename.txt|while read line
do

f1=`echo $line|awk '{print substr($1,1,9)}'`
f2=`echo $line|awk '{print substr($1,10,20)}'`
f3=`echo $line|awk '{print substr($1,30,50)}'`
f4=`echo $line|awk '{print substr($1,80,10)}'`
f5=`echo $line|awk '{print substr($1,90,50)}'`
f6=`echo $line|awk '{print substr($1,140,10)}'`
f7=`echo $line|awk '{print substr($1,150,50)}'`
f8=`echo $line|awk '{print substr($1,200,10)}'`
f9=`echo $line|awk '{print substr($1,210,50)}'`
f10=`echo $line|awk '{print substr($1,260,10)}'`
f11=`echo $line|awk '{print substr($1,270,8)}'`
f12=`echo $line|awk '{print substr($1,278,8)}'`

s1=`echo $f1"|"$f2"|"$f3"|"$f4"|"$f5"|"`
s2=`echo $f6"|"$f7"|"$f8"|"`
s3=`echo $f9"|"$f10"|"`
s4=`echo $f11"|"$f12`

echo $s1$s2$s3$s4 >> FinalResult.txt
done

vgersh99 · April 21, 2009, 6:02pm

nawk -f fieldwidth.awk filename.txt > FinalResul.txt

fieldwidth.awk:

function setFieldsByWidth(   i,n,FWS,start,copyd0) {
  # Licensed under GPL Peter S Tillier, 2003
  # NB corrupts $0
  copyd0 = $0                             # make copy of $0 to work on
  if (length(FIELDWIDTHS) == 0) {
    print "You need to set the width of the fields that you require" > "/dev/stderr"
    print "in the variable FIELDWIDTHS (NB: Upper case!)" > "/dev/stderr"
    exit(1)
  }

  if (!match(FIELDWIDTHS,/^[0-9 ]+$/)) {
    print "The variable FIELDWIDTHS must contain digits, separated" > "/dev/stderr"
    print "by spaces." > "/dev/stderr"
    exit(1)
  }

  n = split(FIELDWIDTHS,FWS)

  if (n == 1) {
    print "Warning: FIELDWIDTHS contains only one field width." > "/dev/stderr"
    print "Attempting to continue." > "/dev/stderr"
  }

  start = 1
  for (i=1; i <= n; i++) {
    $i = substr(copyd0,start,FWS)
    start = start + FWS
  }
}

#Note that the "/dev/stderr" entries in some lines have wrapped.

#I then call setFieldsByWidth() in my main awk code as follows:
BEGIN {
  #FIELDWIDTHS="7 6 5 4 3 2 1" # for example
  # adjust the FIELDWIDTHS values as you see fit.
  FIELDWIDTHS="9 21 51 11 51 11 51 11 51 11 9 9" # for example
  OFS="|"
}
!/^[  ]*$/ {
  saveDollarZero = $0 # if you want it later
  setFieldsByWidth()
  # now we can manipulate $0, NF and $1 .. $NF as we wish
  # print $0 OFS
  print $1,$2,$3,$4,$5,$6,$7,$9,$10,$11,$12
  next
}

JerryHone · April 21, 2009, 6:12pm

A simpler method is to create script parse.awk

{
f1=substr($1,1,9);
f2=substr($1,10,20);
f3=substr($1,30,50);
f4=substr($1,80,10);
f5=substr($1,90,50);
f6=substr($1,140,10);
f7=substr($1,150,50);
f8=substr($1,200,10);
f9=substr($1,210,50);
f10=substr($1,260,10);
f11=substr($1,270,8);
f12=substr($1,278,8);

OFS="|";
print f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f11,f12;
print "\n";
}

Then run

awk -f parse.awk filename.txt > FinalResult.txt

I believe that your original code is taking a long time as each backtick, echo and awk is spawning a new process

cfajohnson · April 21, 2009, 11:27pm

awk '{
 print  substr($1,1,9) "|" \
        substr($1,10,20) "|" \
        substr($1,30,50) "|" \
        substr($1,80,10) "|" \
        substr($1,90,50) "|" \
        substr($1,140,10) "|" \
        substr($1,150,50) "|" \
        substr($1,200,10) "|" \
        substr($1,210,50) "|" \
        substr($1,260,10) "|" \
        substr($1,270,8) "|" \
        substr($1,278,8)
}' filename.txt > FinalResult.txt

amitmathapati · April 22, 2009, 12:57am

Hi ppl..

what if I have the line like this
A BCD

which indicates that first field is f1=A, f2= f3=BCD
i.e. second field has 6 blank characters. So now if I use the above script, I am not able to get the fields in that case.
Can you please suggest in that case how to go about it?

Cheers Amit

cfajohnson · April 22, 2009, 9:07am

Use $0 instead of $1 (which is what I should have used):

awk '{
 print  substr($0,1,9) "|" \
        substr($0,10,20) "|" \
        substr($0,30,50) "|" \
        substr($0,80,10) "|" \
        substr($0,90,50) "|" \
        substr($0,140,10) "|" \
        substr($0,150,50) "|" \
        substr($0,200,10) "|" \
        substr($0,210,50) "|" \
        substr($0,260,10) "|" \
        substr($0,270,8) "|" \
        substr($0,278,8)
}' filename.txt > FinalResult.txt

ppat7046 · April 22, 2009, 9:35am

Thank you all for your reply.

I used suggestion provided by cfajohnson and now it takes only 20 secconds to parse the 800,000 records.

Thank you very much,
Prashant