How to parse fixed-width columns which may include empty fields?

ahsh79 · August 2, 2012, 2:00pm

I am trying to selectively display several columns from a db2 query, which gives me a fixed-width output (partial output listed here):

--------- -------------------------- ------------ ------
000       0000000000198012           702          29
000       0000000000198013                        29
000                                  702          29
000       0000000000198015           702          03
055       0000000000000001           702

I am trying to accomplish two things:

parsing each field from each column, even the ones that are blank
put a delimiter of ";" between each field

So expected output would be:

000;0000000000198012;702;29
000;0000000000198013;;29
000;;702;29
000;0000000000198015;702;03
055;0000000000000001;702;

This is my draft code, which works:
myquery | cut -c 1-9,11-48,60-68,112-142,143-159 --output-delimiter=";"

However my target system does not support any of the GNU utilities, so I cannot use the "output-delimiter" option with cut, or the "FIELDWIDTHS" option with gawk.

So how can I get the desired output, when there are blank fields in some columns?

Thanks in advance for your help!

bartus11 · August 2, 2012, 2:28pm

To keep your data's formatting, use code tags , like this:

--------- -------------------------- ------------ ------
000       0000000000198012           702          29
000       0000000000198013                        29
000                                  702          29
000       0000000000198015           702          03
055       0000000000000001           702

What should be desired output for that sample data (with code tags)?

bobbygsk · August 2, 2012, 2:39pm

sed "s/ \{1\,\}/;/g" <filename>

ahsh79 · August 2, 2012, 3:06pm

I tried it in my code:

myquery | cut -c 1-9,11-48,60-68,112-142,143-159 | sed "s/ \{1\,\}/;/g"

and although it's putting in the delimiter between the fields correctly, I am still having a problem where when one field is blank, it is pulling the data from the next non-blank field.

Here is the actual output:

000;0000000000198012;702;DS;WNT000
000;0000000000198013;702;DS;WNT100
000;0000000000198014;702;DS;WNT100
000;0000000000198015;702;ABC123456;DS;WNT100
100;0000000000000001;702;ABC123;DS;WNT100

Here is the desired output (notice the empty fields in the 4th column);

000;0000000000198012;702; ;DS;WNT000
000;0000000000198013;702; ;DS;WNT100
000;0000000000198014;702; ;DS;WNT100
000;0000000000198015;702;ABC123456;DS;WNT100
100;0000000000000001;702;ABC123;DS;WNT100

bobbygsk · August 2, 2012, 3:28pm

thought I'm not master to get the result in one stream, I'll use another sed command .

myquery | cut -c 1-9,11-48,60-68,112-142,143-159 | sed "s/ \{1\,\}/;/g" | sed "s/;;/;/g"

---------- Post updated at 03:28 PM ---------- Previous update was at 03:26 PM ----------

ignore my above post

binlib · August 2, 2012, 3:29pm

awk '
BEGIN {
  n = split("1-9,11-36,38-49,51-56", a, "[^0-9]") - 1
  for (i = 1; i <= n; i += 2)
    l = a[i+1] - (p = a) + 1
}
{
  for (i = 1; i <= n; i += 2) {
    x = substr($0, p, l)
    sub(/ +$/, "", x)
    printf("%s%s", x, (i < n)? "," : "\n")
  }
}'

Don_Cragun · August 2, 2012, 3:32pm

ahsh79:

I tried it in my code:
myquery | cut -c 1-9,11-48,60-68,112-142,143-159 | sed "s/ \{1\,\}/;/g"
and although it's putting in the delimiter between the fields correctly, I am still having a problem where when one field is blank, it is pulling the data from the next non-blank field.

Here is the actual output:
000;0000000000198012;702;DS;WNT000
000;0000000000198013;702;DS;WNT100
000;0000000000198014;702;DS;WNT100
000;0000000000198015;702;ABC123456;DS;WNT100
100;0000000000000001;702;ABC123;DS;WNT100
Here is the desired output (notice the empty fields in the 4th column);
000;0000000000198012;702; ;DS;WNT000
000;0000000000198013;702; ;DS;WNT100
000;0000000000198014;702; ;DS;WNT100
000;0000000000198015;702;ABC123456;DS;WNT100
100;0000000000000001;702;ABC123;DS;WNT100

Do you want <space> representing an empty field? Or do you want:

000;0000000000198012;702;;DS;WNT000
000;0000000000198013;702;;DS;WNT100
000;0000000000198014;702;;DS;WNT100
000;0000000000198015;702;ABC123456;DS;WNT100
100;0000000000000001;702;ABC123;DS;WNT100

ahsh79 · August 2, 2012, 3:39pm

I prefer no space between the two ;; I had just used that to highlight the empty field.

Don_Cragun · August 2, 2012, 5:12pm

I think we've had a problem communicating because the example input doesn't match the sample results you said that you want to see. When you said the input was truncated, I think most of us thought you were giving us complete lines (but just giving us a few lines; not that you were also shortening fields in the few lines you gave us). And then showing a <space> between semicolons in sample output when you didn't want the space there caused more confusion.

I think you'll get what you want if you take the code supplied by binlib modified to use the field start and end points specified in your first message, and modified to use ";" instead of "," as the output field separator:

awk '
BEGIN {
  n = split("1-9,11-48,60-68,112-142,143-159", a, "[^0-9]") - 1
  for (i = 1; i <= n; i += 2)
    l = a[i+1] - (p = a) + 1
}
{
  for (i = 1; i <= n; i += 2) {
    x = substr($0, p, l)
    sub(/ +$/, "", x)
    printf("%s%s", x, (i < n)? ";" : "\n")
  }
}'

If you're having trouble following the initial set up in the BEGIN clause or the for loop in the main body of the script, another way to do it would be:

awk '{
	out=sprintf("%s;%s;%s;%s;%s", substr($0,1,8), substr($0,11,36),
		substr($0,60,7), substr($0,112,29), substr($0,143))
	gsub(" ","",out)
	print out
}'

ahsh79 · August 3, 2012, 8:13am

My apologies for posting different sample output. I agree, that was confusing.

And that works!!! Thanks!