awk script to remove spaces - examples don't show up correctly

I have the following data from a manual database dump. I need to format the columns so that I can import them into an excel spread sheet. So far I have been able to get past the hurdles with vi and grep. Now I have one last issue that I can't get past. Here is an example of the data.

Here is what it looks like. The third column is a part description with spaces and other characters, but none one is a consistent delimiter.
-----------------------------

2 910-0158                               ENCLOSURE,TOP,SMK,10-YR                  B   4052      1.00000 EA  U J M  M  No
2 910-0159                               ENCLOSURE,BOTTOM,SMK,10-YR               E   5404      1.00000 EA  U J M  M  No
2 910-0160                               TRIMPLATE,SMK,DC                         D   4735      1.00000 EA  U J M  M  No
2 910-0155                               ROTATE ACTIVATE,SMK,10-YR                C   4732      1.00000 EA  U J M  M  No
2 910-0156                               ROTATE DEACTIVATE,SMK,10-YR              B   4733      1.00000 EA  U J M  M  No
2 910-0157                               SLIDER,ACTIVATE-DEACTIVATE,SMK,10-YR     D   4734      1.00000 EA  U J M  M  No
2 910-0171                               SUPPORT PLATE,SMK,10-YR                  B   4737      1.00000 EA  U J M  M  No
2 910-0149                               BUTTON,PUSH TO TEST,SMK                  D   3828      1.00000 EA  U J M  M  No
2 910-0150                               BUTTON,HUSH,SMK                          E   4714      1.00000 EA  U J M  M  No
2 910-0151                               LIGHT PIPE,SMK,NGS                       A             1.00000 EA  U J M  M  No
2 940-0084                               CLICHE,KIDDE LOGO ONLY,5mm,"K"           A             0.00000 EA  U J M  M  No
2 0008-2305                              SCREW,PAN HEAD,M2.3X6                    A             2.00000 EA  U J M  M  No
1 800-0592                               ASY,SUB,PALLET,0910,4PK,C/S,288 PER      A             0.00438 EA  U J M  M  No
2 810-1339                               PKG,PALLET,CORR.PAPER,43.5x49x4.5        A             1.00000 EA  U P M  P  No
2 820-0157                               LABEL,DECAL,BARCODE,4x2,FINAL PKG,FHK    S   4988     57.00000 EA  U J M  M  No
3 810-1157                               LABEL,DECAL,4X2,WHT                      A             1.00000 EA  U P M  P  No
2 0000-7513                              PKG,EXT,43X2X7/32,HORIZONTAL             0             2.00000 EA  U J M  M  No
2 0000-7516                              PKG,EXT,39X2X7/34,HORIZONTAL             0             2.00000 EA  U J M  M  No

The first number of each line basically lets us know how many white spaces are needed to be removed from in front of the 3rd column in order for the following columns to line up correctly. My posting has removed the white spaces but if you could see it, it would have about 20-30 spaces between the 2nd and 3rd column depending on the first number. I can't seem to get the awk script to use the sub() function properly to replace a number of white spaces with just one or just to remove it altogether. I think its my regex. But I'm sure i'm skinning the cat the wrong way this time. Any help would be appreciated. I've read many posts on white space manipulation but none really help with this situation. The dump of the database is about 60,000 lines long. Too much to manually do line by line.

chris

Hi,

please post some examples of what your desired output should look like.
I don't quite understand how you mean the columns. So what exactly
is the 3rd column?

Regards

Chris

The white spaces are removed from my posting. Is there a way I can post without that happening? Otherwise, I'd have to email you directly the examples of what it currently looks like and what is should look like.

chris

Use code tags. Highlight your code-segment and press
the fourth button from the right of the above tool menu.
It is the litte "#" to the left of "<>".
This should preserve the whitespaces and give an
example of the desired output.

what is should look like
--------------------------

1 21007527                          LID,DET,C/S,BULK,PEZ,7 1/2x18 1/2x X     A             1.00000 EA  U J M  M  No
1 21007528                          TRAY,DET,C/S,BULK,PEZ,7 1/2x18 1/2x X    A             1.00000 EA  U J M  M  No
1 820-1625                          LABEL,INFO,BULK C/S,900-0076             A             1.00000 EA  U J M  M  No
2 810-2399                          LABEL,DIESTRIKE,6X4                      A             1.00000 EA  U J M  M  No
1 900-0076-003                      CO,900-0076,C/S,UL,SMT,EN                A            12.00000 EA  U J M  M  No
2 800-0056                          ASSY,CO,SUB,MOUNTING_KIT,(SCREW/ANCHOR)  C   1552      2.00000 EA  U P M  P  No
3 810-1292                          BAG,POLY,CLR,2 X 3                       A   2748      1.00000 EA  L I O  M  No

What it does look like
------------------------

3 810-2024                              RES,FXD,10_OHM,1%,1/8W,MET               A             1.00000 EA  L I O  M  No
3 810-2373                              RES,FXD,10_OHM,5%,1/8W,CAR               O             1.00000 EA  U J M  M  No
3 810-2453                              CAP,CM,15pf,20%,AXIAL                    B   4136      2.00000 EA  L I O  M  No
3 820-0163                              LABEL,DECAL,.2X.6,WHT,PRINTED,CO/SMK,PCB 0             1.00000 EA  U J M  M  No
4 810-1119                             LABEL,.20X.60,WHT,PAPER,(FACTORY ID)     0             1.00000 EA  U J M  M  No
3 2505-9501                             ASY,SUB,SMOKE CHAMBER                    2   1685      1.00000 EA  U J M  M  No
4 0910-4101                            REFERENCE PLATE                          PR            1.00000 EA  U J M  M  No
4 0910-4102                            SOURCE PLATE                             3             1.00000 EA  L I O  M  No
4 0910-4103                            SOURCE HOLDER                            5             1.00000 EA  U J M  M  No
4 0915-4101                            SOURCE                                   A   2689      1.00000 EA  L I O  M  No
4 0915-4107                            CHAMBER,SMOKE                            PR            1.00000 EA  U J M  M  No
4 910-0118                             CHAMBER,PLASTIC HOUSING                  A   1769      1.00000 EA  L I O  M  No
3 950-0012                              SCHEMATIC,ATWD,BAT,CO/SMOKE,4.5V,UL      19  4304      0.00000 EA  U J M  M  No
2 800-0387                               ASY,SUB,PCBA,CO/SMK,VOICE,UL             B   3293      1.00000 EA  U J M  M  No
3 810-0990                              SPEAKER,MYLAR,8_OHM,36mm                 D   5502      1.00000 EA  U J M  M  No
3 810-1223                              RES,SMD,10K,5%,1/8W,0805                 O             1.00000 EA  U J M  M  No
3 810-1226                              RES,SMD,1K,5%,1/8W,0805                  O             1.00000 EA  U J M  M  No
3 810-1227                              CAP,SMD,47pF,10%                         O             1.00000 EA  U J M  M  No
3 810-1233                              INDUCTOR,200uh,0.5OHM,Radial             0             1.00000 EA  U J M  M  No

Thanks for the instruction on how to keep the form entact.

So, try:

awk -F'[ ][ ]+' '\
NF==7{printf "%-13s %45s %15s %10s %10s %10s %s\n", $1,$2,$3,$4,$5,$6,$7};\
NF==6{printf "%-13s %45s %15s %10s %10s %10s\n", $1,$2,$3,$4,$5,$6}' file4

The code uses multiple spaces as field delimiter. It works pretty well
except for cases where there is only one space as delimiter. I would replace
them with another little script and then run this one. Sorry, but at
the moment i have no time.

HTH Chris

I played around with your snippet of code. It is definitely useful but it has solved my problem just yet. Although I get alignment on column 3, the following columns are out of place. Any other suggestions? Is there any way to set a variable based on the first column and subtract spaces from between the 2nd and 3rd columns?

Another issue is that I need to maintain the white spaces in the last several columns in or do account for place holders for no data.

Your first post seems to show the data already lining up in columns.
Also, you might review the substr($0,10,2) awk command; which would take the input line and output 2 characters beginning at position #10.

Can you provide a sample of the input file (with code tags) if you current input file does not line up.

I have a file of about 60,000 lines of output. Each line has multiple fields, some blank. I have an example of what it looks like the way it needs to look by manually lining things up in vi and awk. Here is what is needs to looks like

0 0000                              SEE TEXT                                               1.00000 EA         M  No
1 SEE-TEXT                          INSURE CORRECT PCBA AND FIRMWARE.                      1.00000 EA  U J M  M  No
0 0032RS01                          LABEL,DECAL,BOX,900-0032-00                            1.00000 EA         P  No
0 0910-BC                           SMK,0910,BULK,C/S,UL                     A             1.00000 EA         M  No
0 1475-2104                         BATTERY DOOR                             02  2715      1.00000 EA         M  No
0 21002381                          CO,900-0076,C/S,12PK,PEZ,UL              A             1.00000 EA         P  No
1 21007527                          LID,DET,C/S,BULK,PEZ,7 1/2x18 1/2x X     A             1.00000 EA  U J M  M  No
1 21007528                          TRAY,DET,C/S,BULK,PEZ,7 1/2x18 1/2x X    A             1.00000 EA  U J M  M  No
1 820-1625                          LABEL,INFO,BULK C/S,900-0076             A             1.00000 EA  U J M  M  No
2 810-2399                          LABEL,DIESTRIKE,6X4                      A             1.00000 EA  U J M  M  No
1 900-0076-003                      CO,900-0076,C/S,UL,SMT,EN                A            12.00000 EA  U J M  M  No
2 800-0056                          ASSY,CO,SUB,MOUNTING_KIT,(SCREW/ANCHOR)  C   1552      2.00000 EA  U P M  P  No
3 810-1292                          BAG,POLY,CLR,2 X 3                       A   2748      1.00000 EA  L I O  M  No
3 810-0009                          SCREW,PHILLIPS,#8x3/4",TYPE_A,PANHEAD    C    679      1.00000 EA  U P M  P  No
3 810-0010                          ANCHOR,PLASTIC,#8-1"_LG                  C    680      1.00000 EA  U P M  P  No
2 810-0629                          INK,COOL_GRA,TYPE_P-NT,GLOSSY(PMS_10)    0             0.00001 OZ  U P M  P  No
2 810-0689                          LABEL,DECAL,NO_BATTERY,ORANGE            C   2669      1.00000 EA  U P M  P  No
2 810-0939                          LABEL,ATWD,DLX,UL,4.1x.57,"DO NOT PAINT" O             1.00000 EA  U J M  M  No
2 810-1025                          LABEL,WARNING,WALL MOUNT,w\ADH,ENGL      F   5443      2.00000 EA  U J M  M  No
2 810-1174                          BAG,ACTIVATED CARBON,TYVEK,5 GRAM,1.5x3  C1  4436      1.00000 EA  U J M  M  No
2 810-1270                          LABEL,SECURITY LABEL,1.78"x.42",WHT      B   3886      1.00000 EA  U J M  M  No

Here is an example of the data without alteration:

0 21006658                                 ASY,SMK,10-YR,UL,CS,4 PK                 A             1.00000 EA         M  No
1 900-0136                                ASY,SMK,10-YR,UL                         C   4730      4.00000 EA  U J M  M  No
2 800-0448                               ASY,SUB,PCBA,SMK,DC,10-YR                A   4998      1.00000 EA  U J M  M  No
3 810-1734                              BATTERY,CR2/3,8L-FIST,35                 B   3754      3.00000 EA  U J M  M  No
3 6001-1009                             CAP.,0.001uf                                           3.00000 EA  U J M  M  No
3 6701-1007                             CAP,MC,0.1UF,50V,10%,2.5MM               A             1.00000 EA  U J M  M  No
3 6103-1006                             CAP,MC,1.0UF,50V,20%,DXL:4X7             A             1.00000 EA  U J M  M  No
3 5925-2001                             DIODE,1N4001                                           3.00000 EA  U J M  M  No
3 5910-1001                             DIODE 1N4148 AXIAL                       A             1.00000 EA  U J M  M  No
3 5980-0011                             LED,RED,5MM L-15MM & 10MM                A             1.00000 EA  U J M  M  No
3 5051-5100                             RES,5.1_OHM                                            3.00000 EA  L I O  M  No
3 5052-1801                             RES.,18_OHM                                            2.00000 EA  U J M  M  No
3 5111-0018                             RES.,100K-22M,1/4W                                     2.00000 EA  U J M  M  No
3 5051-2205                             RES.,220K,CF,1/4W,5%,AXIAL               A             1.00000 EA  U J M  M  No
3 5051-1506                             RES.,1.5M                                              2.00000 EA  U J M  M  No
3 5111-1475                             RES,147K,1%                                            1.00000 EA  L I O  M  No
3 5051-1306                             RES.,1.3M                                              1.00000 EA  U J M  M  No
3 5051-0006                             RES.,1.5-10M,1/4W                                      1.00000 EA  U J M  M  No
3 5051-4702                             RES.,470_ohm                                           1.00000 EA  U J M  M  No
3 5051-2207                             RES.,22M,1/4W                                          1.00000 EA  U J M  M  No

The first column is the "level" indicator. 0 = 7 space indentation 1 = 6 space indentation.......

My goal is to align the columns so that I can then import into a spreadsheet and define the columns based on the set widths. If one cell is empty, it needs to have a white space in the column to ensure the remaining cells fill correctly.

What I have been trying to do is to set a variable based on the first column in an awk script and then try and delete the white spaces between the 2nd and 3rd column. Let me know if this is clear. I am working on the sub() and now the substr() functions.

Thanks for your help,

chris

Your are right, it can be so simple...

sed '/^3/s/^\(.\{15\}\)\s/\1/;/^2/s/^\(.\{15\}\)\s\s/\1/' file

For me this works perfectly.

And with a little sed file now covering the new special cases of
your latest example:

command:

sed -f sedfile file

sedfile:

/^3/s/^\(.\{15\}\)\s/\1/
/^2/s/^\(.\{15\}\)\s\s/\1/ 
/^1/s/^\(.\{15\}\)\s\s\s/\1/ 
/^0/s/^\(.\{15\}\)\s\s\s\s/\1/ 

This should be easily adaptable to your needs.

HTH Chris

> cat file125
0 21006658                                 ASY,SMK,10-YR,UL,CS,4 PK                 A             1.00000 EA         M  No
1 900-0136                                ASY,SMK,10-YR,UL                         C   4730      4.00000 EA  U J M  M  No
2 800-0448                               ASY,SUB,PCBA,SMK,DC,10-YR                A   4998      1.00000 EA  U J M  M  No
3 810-1734                              BATTERY,CR2/3,8L-FIST,35                 B   3754      3.00000 EA  U J M  M  No
3 6001-1009                             CAP.,0.001uf                                           3.00000 EA  U J M  M  No
3 6701-1007                             CAP,MC,0.1UF,50V,10%,2.5MM               A             1.00000 EA  U J M  M  No
3 6103-1006                             CAP,MC,1.0UF,50V,20%,DXL:4X7             A             1.00000 EA  U J M  M  No
3 5925-2001                             DIODE,1N4001                                           3.00000 EA  U J M  M  No
3 5910-1001                             DIODE 1N4148 AXIAL                       A             1.00000 EA  U J M  M  No
3 5980-0011                             LED,RED,5MM L-15MM & 10MM                A             1.00000 EA  U J M  M  No
3 5051-5100                             RES,5.1_OHM                                            3.00000 EA  L I O  M  No
3 5052-1801                             RES.,18_OHM                                            2.00000 EA  U J M  M  No
3 5111-0018                             RES.,100K-22M,1/4W                                     2.00000 EA  U J M  M  No
3 5051-2205                             RES.,220K,CF,1/4W,5%,AXIAL               A             1.00000 EA  U J M  M  No
3 5051-1506                             RES.,1.5M                                              2.00000 EA  U J M  M  No
3 5111-1475                             RES,147K,1%                                            1.00000 EA  L I O  M  No
3 5051-1306                             RES.,1.3M                                              1.00000 EA  U J M  M  No
3 5051-0006                             RES.,1.5-10M,1/4W                                      1.00000 EA  U J M  M  No
3 5051-4702                             RES.,470_ohm                                           1.00000 EA  U J M  M  No
3 5051-2207                             RES.,22M,1/4W                                          1.00000 EA  U J M  M  No
> space="~~~~~~~"
> awk -v space=$space '{print substr($0,1,37)substr(space,1,substr($0,1,1))substr($0,41)}' file125 | tr "~" " "
0 21006658                              ASY,SMK,10-YR,UL,CS,4 PK                 A             1.00000 EA         M  No
1 900-0136                              ASY,SMK,10-YR,UL                         C   4730      4.00000 EA  U J M  M  No
2 800-0448                              ASY,SUB,PCBA,SMK,DC,10-YR                A   4998      1.00000 EA  U J M  M  No
3 810-1734                              BATTERY,CR2/3,8L-FIST,35                 B   3754      3.00000 EA  U J M  M  No
3 6001-1009                             CAP.,0.001uf                                           3.00000 EA  U J M  M  No
3 6701-1007                             CAP,MC,0.1UF,50V,10%,2.5MM               A             1.00000 EA  U J M  M  No
3 6103-1006                             CAP,MC,1.0UF,50V,20%,DXL:4X7             A             1.00000 EA  U J M  M  No
3 5925-2001                             DIODE,1N4001                                           3.00000 EA  U J M  M  No
3 5910-1001                             DIODE 1N4148 AXIAL                       A             1.00000 EA  U J M  M  No
3 5980-0011                             LED,RED,5MM L-15MM & 10MM                A             1.00000 EA  U J M  M  No
3 5051-5100                             RES,5.1_OHM                                            3.00000 EA  L I O  M  No
3 5052-1801                             RES.,18_OHM                                            2.00000 EA  U J M  M  No
3 5111-0018                             RES.,100K-22M,1/4W                                     2.00000 EA  U J M  M  No
3 5051-2205                             RES.,220K,CF,1/4W,5%,AXIAL               A             1.00000 EA  U J M  M  No
3 5051-1506                             RES.,1.5M                                              2.00000 EA  U J M  M  No
3 5111-1475                             RES,147K,1%                                            1.00000 EA  L I O  M  No
3 5051-1306                             RES.,1.3M                                              1.00000 EA  U J M  M  No
3 5051-0006                             RES.,1.5-10M,1/4W                                      1.00000 EA  U J M  M  No
3 5051-4702                             RES.,470_ohm                                           1.00000 EA  U J M  M  No
3 5051-2207                             RES.,22M,1/4W > 

Or a solution using perl:

perl -wlape '$c=4-$F[0];s/^(.{13})\s{$c}/\1/' file

The above perl Snippet worked. Joeyg, your code worked but then bombed out on line 96 of 58000. I'm trying to see why. Thanks everyone for your help. I was surely stumped and skinning the cat the wrong way.

Perhaps do a
head -100 myfile | tail -10 >tempfile

then post those 10 lines of the tempfile

I think the <B1> caused the failure.

3 910-0145                              BAT. CARRIER,TOP,SMK,FRONT LOAD BAT.     A             1.00000 EA  U J M  M  No
3 910-0146                              BAT.CARRIER,BTM,SMK,FRONT LOAD BAT.      C   3872      1.00000 EA  U J M  M  No
3 910-0147                              BAT. LOCKOUT,SMK,FRONT LOAD BAT.         A             1.00000 EA  U J M  M  No
3 1276-7102                             LABEL,BATTERY WARNING,1276               A             1.00000 EA  L I O  M  No
2 800-0472                               ASY,SUB,PCBA,SMK,DC,NGS,UL               C   5000      1.00000 EA  U J M  M  No
3 810-1962                              PCB,SMK,AC/DC,W/HUSH                     6   3561      1.00000 EA  U J M  M  No
3 810-2053                              CAP,CD,.01uf,<B1>20%50V,RADIAL              A             2.00000 EA  U J M  M  No
3 810-2788                              SOUNDER,PIEZO,EXTERNAL DRIVE,3.5KHz      A   5039      1.00000 EA  U J M  M  No
3 950-0087                              SCHEMATIC,0976(CA) DC,SMOKE ALARM        1   4230      1.00000 EA  L I O  M  No
3 1245-6503                             SWITCH,SN PLATED                         A             2.00000 EA  U J M  M  No

the first example does show an issue with the data, while my second command corrects for that oddity.

> awk -v space=$space '{print substr($0,1,37)substr(space,1,substr($0,1,1))substr($0,41)}' file126 | tr "~" " "
3 910-0145                              BAT. CARRIER,TOP,SMK,FRONT LOAD BAT.     A             1.00000 EA  U J M  M  No
3 910-0146                              BAT.CARRIER,BTM,SMK,FRONT LOAD BAT.      C   3872      1.00000 EA  U J M  M  No
3 910-0147                              BAT. LOCKOUT,SMK,FRONT LOAD BAT.         A             1.00000 EA  U J M  M  No
3 1276-7102                             LABEL,BATTERY WARNING,1276               A             1.00000 EA  L I O  M  No
2 800-0472                              ASY,SUB,PCBA,SMK,DC,NGS,UL               C   5000      1.00000 EA  U J M  M  No
3 810-1962                              PCB,SMK,AC/DC,W/HUSH                     6   3561      1.00000 EA  U J M  M  No
3 810-2053                              CAP,CD,.01uf,<B1>20%50V,RADIAL              A             2.00000 EA  U J M  M  No
3 810-2788                              SOUNDER,PIEZO,EXTERNAL DRIVE,3.5KHz      A   5039      1.00000 EA  U J M  M  No
3 950-0087                              SCHEMATIC,0976(CA) DC,SMOKE ALARM        1   4230      1.00000 EA  L I O  M  No
3 1245-6503                             SWITCH,SN PLATED                         A             2.00000 EA  U J M  M  No
> awk -v space=$space '{print substr($0,1,37)substr(space,1,substr($0,1,1))substr($0,41)}' file126 | tr "~" " " | sed 's/<B1>/ /'
3 910-0145                              BAT. CARRIER,TOP,SMK,FRONT LOAD BAT.     A             1.00000 EA  U J M  M  No
3 910-0146                              BAT.CARRIER,BTM,SMK,FRONT LOAD BAT.      C   3872      1.00000 EA  U J M  M  No
3 910-0147                              BAT. LOCKOUT,SMK,FRONT LOAD BAT.         A             1.00000 EA  U J M  M  No
3 1276-7102                             LABEL,BATTERY WARNING,1276               A             1.00000 EA  L I O  M  No
2 800-0472                              ASY,SUB,PCBA,SMK,DC,NGS,UL               C   5000      1.00000 EA  U J M  M  No
3 810-1962                              PCB,SMK,AC/DC,W/HUSH                     6   3561      1.00000 EA  U J M  M  No
3 810-2053                              CAP,CD,.01uf, 20%50V,RADIAL              A             2.00000 EA  U J M  M  No
3 810-2788                              SOUNDER,PIEZO,EXTERNAL DRIVE,3.5KHz      A   5039      1.00000 EA  U J M  M  No
3 950-0087                              SCHEMATIC,0976(CA) DC,SMOKE ALARM        1   4230      1.00000 EA  L I O  M  No
3 1245-6503                             SWITCH,SN PLATED                         A             2.00000 EA  U J M  M  No
> 

Just wanted to say thanks to joeyg and chris for the help with this problem. I made adjustments to joeyg's code and it formatted everything perfectly. I did come across with some special characters (40 instances) but removed them and added them back after I was completed with the 60,0000 lines. Thanks for everything guys. I couldn't have done it without any of your suggestions.

chris