Converting fixed width file to pipe delimiter in Linux(red-hat)

brij_abhi · February 5, 2015, 3:15am

Hi,
I am facing a typical scenario for AWK command .
In HP- UNIX is behave as expected but in red hat linux same awk code is not give the same result.

The below code is for convert the fixed width file to pipe delimiter file in HP-unix server.
awk code:

#!/bin/awk -f

NR!=1 {while(substr($0,1,2)!="GA")
      {gsub(/\|/,"-",$0); if(substr($0,1,1)=="H"&&substr($0,343,39)!="RUA DR. EDUARDO SANTOS SILVA 261 - FRAC"){print substr($0,1,11)"|" substr($0,12,12)"|" substr($0,24,4)"|" substr($0,28,3)"|" substr($0,31,40)"|" substr($0,71,14)"|" substr($0,85,2)"|" substr($0,87,4)"|" substr($0,91,10)"|" substr($0,101,46)"|" substr($0,147,35)"|" substr($0,182,8)"|" substr($0,190,2)"|" substr($0,192,2)"|" substr($0,194,1)"|" substr($0,195,2)"|" substr($0,197,2)"|" substr($0,199,2)"|" substr($0,201,2)"|" substr($0,203,2)"|" substr($0,205,8)"|" substr($0,213,16)"|" substr($0,229,3)"|" substr($0,232,3)"|" substr($0,235,2)"|" substr($0,237,2)"|" substr($0,239,2)"|" substr($0,241,8)"|" substr($0,249,2)"|" substr($0,251,12)"|" substr($0,263,40)"|" substr($0,303,40)"|" substr($0,343,40)"|" substr($0,383,40)"|" substr($0,423,40)"|" substr($0,463,40)"|" substr($0,503,40)"|" substr($0,543,40)"|" substr($0,583,40)"|" substr($0,623,40)"|" substr($0,663,40)"|" substr($0,703,40)"|" substr($0,743,40)"|" substr($0,783,40)"|" substr($0,823,40)"|" substr($0,863,5)"|" substr($0,868,1)"|" substr($0,869,1)"|" substr($0,870,2)"|" substr($0,872,32)"|" substr($0,904,16)"|" substr($0,920,16)"|" substr($0,936,2)"|" substr($0,938,2)"|" substr($0,940,2)"|" substr($0,942,12)"|" substr($0,954,4)"|" substr($0,958,8)"|" substr($0,966,2)"|" substr($0,968,6)"|" substr($0,974,3)"|" substr($0,977,10)"|" substr($0,987,4)"|" substr($0,991,10)"|" substr($0,1001,2)"|" substr($0,1003,4)"|" substr($0,1007,40)"|" substr($0,1047,24)"|" substr($0,1071,24)"|" substr($0,1095,24)"|" substr($0,1119,1)"|" substr($0,1120,14)"|" substr($0,1134,2)"|" substr($0,1136,4)"|" substr($0,1140,16)"|" substr($0,1156,14)"|" substr($0,1170,1)"|" substr($0,1171,8)"|" substr($0,1179,8)"|" substr($0,1187,9)"|"  substr($0,1196,20);next}
       else if(substr($0,1,1)=="H"&&substr($0,343,39)=="RUA DR. EDUARDO SANTOS SILVA 261 - FRAC"){print substr($0,1,11)"|" substr($0,12,12)"|" substr($0,24,4)"|"substr($0,28,3)"|" substr($0,31,40)"|" substr($0,71,14)"|" substr($0,85,2)"|" substr($0,87,4)"|" substr($0,91,10)"|" substr($0,101,46)"|" substr($0,147,35)"|" substr($0,182,8)"|" substr($0,190,2)"|" substr($0,192,2)"|" substr($0,194,1)"|" substr($0,195,2)"|" substr($0,197,2)"|" substr($0,199,2)"|" substr($0,201,2)"|" substr($0,203,2)"|" substr($0,205,8)"|" substr($0,213,16)"|" substr($0,229,3)"|" substr($0,232,3)"|" substr($0,235,2)"|" substr($0,237,2)"|" substr($0,239,2)"|" substr($0,241,8)"|" substr($0,249,2)"|" substr($0,251,12)"|" substr($0,263,40)"|" substr($0,303,40)"|" substr($0,343,39)"?|" substr($0,383,40)"|" substr($0,423,40)"|" substr($0,463,40)"|" substr($0,503,40)"|" substr($0,543,40)"|" substr($0,583,40)"|" substr($0,623,40)"|" substr($0,663,40)"|" substr($0,703,40)"|" substr($0,743,40)"|" substr($0,783,40)"|" substr($0,823,40)"|" substr($0,863,5)"|" substr($0,868,1)"|" substr($0,869,1)"|" substr($0,870,2)"|" substr($0,872,32)"|" substr($0,904,16)"|" substr($0,920,16)"|" substr($0,936,2)"|" substr($0,938,2)"|" substr($0,940,2)"|" substr($0,942,12)"|" substr($0,954,4)"|" substr($0,958,8)"|" substr($0,966,2)"|" substr($0,968,6)"|" substr($0,974,3)"|" substr($0,977,10)"|" substr($0,987,4)"|" substr($0,991,10)"|" substr($0,1001,2)"|" substr($0,1003,4)"|" substr($0,1007,40)"|" substr($0,1047,24)"|" substr($0,1071,24)"|" substr($0,1095,24)"|" substr($0,1119,1)"|" substr($0,1120,14)"|" substr($0,1134,2)"|" substr($0,1136,4)"|" substr($0,1140,16)"|" substr($0,1156,14)"|" substr($0,1170,1)"|" substr($0,1171,8)"|" substr($0,1179,8)"|" substr($0,1187,9)"|" substr($0,1196,20);next}
       else if(substr($0,1,1)=="C"){print substr($0,1,13);next}
       else if(substr($0,1,1)=="P"){print substr($0,1,13)"|" substr($0,14,4)"|" substr($0,18,2)"|" substr($0,20,8)"|" substr($0,28,6)"|" substr($0,34,14)"|" substr($0,48,10)"|"substr($0,58,10);next}
       else if(substr($0,1,1)=="D"){print substr($0,1,13)"|" substr($0,14,4)"|" substr($0,18,4)"|" substr($0,22,2)"|" substr($0,24,20)"|" substr($0,44,6)"|" substr($0,50,60)"|" substr($0,110,8)"|" substr($0,118,8)"|" substr($0,126,8)"|" substr($0,134,8)"|" substr($0,142,4)"|" substr($0,146,2)"|" substr($0,148,4)"|" substr($0,152,4)"|" substr($0,156,4)"|" substr($0,160,3)"|" substr($0,163,2)"|" substr($0,165,8)"|" substr($0,173,1)"|" substr($0,174,1)"|" substr($0,175,1)"|" substr($0,176,4)"|" substr($0,180,2)"|" substr($0,182,15)"|" substr($0,197,1)"|" substr($0,198,8)"|" substr($0,206,8)"|" substr($0,214,1)"|" substr($0,215,14)"|" substr($0,229,1)"|" substr($0,230,14)"|" substr($0,244,3)"|" substr($0,247,35)"|" substr($0,282,2)"|" substr($0,284,6)"|" substr($0,290,4)"|" substr($0,294,6)"|" substr($0,300,10);next}
       else if(substr($0,1,1)=="A"){print substr($0,1,13)"|" substr($0,14,4)"|" substr($0,18,4)"|" substr($0,22,8)"|" substr($0,30,2)"|" substr($0,32,8)"|" substr($0,40,8)"|" substr($0,48,4)"|" substr($0,52,8);next}       else if(substr($0,1,1)=="S"){print substr($0,1,13)"|" substr($0,14,2)"|" substr($0,16,8)"|" substr($0,24,8)"|" substr($0,32,10)"|" substr($0,42,10)"|" substr($0,52,14)"|" substr($0,66,8)"|" substr($0,74,10)"|" substr($0,84,8)"|" substr($0,92,4)"|" substr($0,96,4)"|" substr($0,100,1)"|" substr($0,101,18)"|" substr($0,119,8);next}
       else if(substr($0,1,1)=="B"||substr($0,1,1)=="I"){print substr($0,1,11)"|" substr($0,12,6)"|" substr($0,18,6)"|" substr($0,24,12)"|" substr($0,36,4)"|" substr($0,40,13)"|" substr($0,53,10)"|" substr($0,63,8)"|" substr($0,71,15)"|" substr($0,86,13)"|" substr($0,99,10);next}}}
      END{if(substr($0,1,2)=="GA"){close($testfile)}}

In the file one record contain some special character like � ,� part of the data. In HP-Unix after converting the file in pipe delimiter the new file contain all the data with all the character.ex:
|ES 28805 Alcal de Henares |TEL GLOBAL TE S.A. |C/ Gran Va 28 | |

But when i am using the same code in linux it ignore the data after special character and all the column became null
ES 28805 Alcal||||||||||||||||||||||||||||||||||||||||||||||

Please advice..

rbatte1 · February 5, 2015, 9:33am

Can you show us some sample input and required output please.

disedorgue · February 5, 2015, 10:19am

Hi,
Can you show the locale of your HP-UX and your linux ?

RudiC · February 5, 2015, 10:23am

Do you have identical locales on both hosts?

On a linux machine, an empty FS is possible, and sth. like

awk  '{MX=split (FLDS, P, " "); for (i=1; i<=MX; i++) $(P)=$(P) "|" } 1' FS="" OFS="" FLDS="7 19 32" file
RUA DR.| EDUARDO SAN|TOS SILVA 261| - FRA

could work?

brij_abhi · February 6, 2015, 1:35am

Hi rbattle,
sample data:

H721615296070R86593102170  999 OPN ID69                               20141117171817T1ZOR 700016901 4791032964                                    BVOM.ES@kk.COM                     20141117072  00  RE  DP20141031SYI1          074704700000532014111411B600*SAP01  Telefonica                              Avenida Punto Com, 23                                                                                                   ES 28805 Alcal� de Henares              TELEFONICA GLOBAL TECHNOLOGY S.A.       C/ Gran V�a 28                                                                                                          ES 28013 Madrid                         TELEFONICA GLOBAL TECHNOLOGY S.A.       C/ Gran V�a 28                                                                                                          ES 28013 Madrid                         HPT&CE3Z1                                       90592.78       113485.58 ECESDP28805       700020141117ORADIN13   0500701206    91717059  ZZ                                                                                                                     20141117173030BBWAT BBTO            20141117173030 0000000000000000 1.25270

Expected Output:

H7216152960|70R865931021|70  |999| OPN ID69                               |20141117171817|T1|ZOR |700016901 |4791032964                                    |BVOM.ES@kk.COM                     |20141117|07|2 | |00|  |RE|  |DP|20141031|SYI1          07|470|470|00|00|53|20141114|11|B600*SAP01  |Telefonica                              |Avenida Punto Com, 23                   |                                        |                                        |ES 28805 Alcal� de Henares              |TELEFONICA GLOBAL TECHNOLOGY S.A.       |C/ Gran V�a 28                          |                                        |                                        |ES 28013 Madrid                         |TELEFONICA GLOBAL TECHNOLOGY S.A.       |C/ Gran V�a 28                          |                                        |                                        |ES 28013 Madrid                         |HPT&C|E|3|Z1|                                |       90592.78 |      113485.58 |EC|ES|DP|28805       |7000|20141117|OR|ADIN13|   |0500701206|    |91717059  |ZZ|    |                                        |                        |                        |                        | |20141117173030|BB|WAT |BBTO            |20141117173030| |00000000|00000000| 1.25270 |

Hi rudic,
when i have used your code but in the second line it given me error . so i have execute your code till "file" only, it resolved the issue, output file contain that special character but it not give the expected result extra pipe delimiter coming, Please check the expected output. i have to put the condition as well as i have give in the previous post.

Please find the attachment for locale for both the machine.

Thanks.

RudiC · February 6, 2015, 5:28am

With a subset of field lengths

echo $FLDS
11 23 26 29 39 53 55 59 69 80 95 103 105 107 108 110 111 113 114 116 124 131 134

extracted from your Expected output, and having corrected your sample data (adding extra spaces that disappeared because you did not use code tags), the result of

awk  '{MX=split (FLDS, P, " "); for (i=1; i<=MX; i++) $(P)=$(P) "|" } 1' FS="" OFS="" FLDS="$FLDS" file > file2

cat file[24] | less
H7216152960|70R865931021|70 |999| OPN ID69 |20141117171817|T1|ZOR |700016901 |4791032964 |BVOM.ES@kk.COM |20141117|07|2 | |00| |RE| |DP|20141031|SYI1 07|470|470|00|00|53|20141114|
H7216152960|70R865931021|70 |999| OPN ID69 |20141117171817|T1|ZOR |700016901 |4791032964 |BVOM.ES@kk.COM |20141117|07|2 | |00| |RE| |DP|20141031|SYI1 07|470|470|00|00|53|20141114|

is pretty close to what you expect (second line)...

I admit there might arise problems with non-ASCII chars as they occupy 2 or more bytes, but I presume it's difficult to create a fixed width file with non-ASCII text.

disedorgue · February 6, 2015, 7:58am

Hi,

I'm not sure, example:

$ echo "Gran V�a" | od -c
0000000   G   r   a   n       V 303 255   a  \n
0000012
$ echo "Gran V�a" | LANG=C awk '{print length($0)}'
9
$ echo "Gran V�a" | LANG=fr_FR.UTF-8 awk '{print length($0)}'
8

Regards.

Don_Cragun · February 6, 2015, 12:34pm

In a correctly working awk utility, length($0) returns the number of (single-byte or multi-byte) characters in that line (not including the terminating <newline> character).

In some (buggy) awk utilities, length($0) returns the number of bytes in that line (not including the terminating <newline> character). On these systems, substr(string, start, count) may return a substring of string that is unusable because it does not start and/or end on a character boundary.

On a correctly working awk when using a locale based on a UTF-8 codeset, the command:

awk 'BEGIN{print "V�a", length("V�a");print "Via", length("Via")}'

will print:

V�a 3
Via 3

while a buggy version might print:

V�a 4
Via 3

Note also that fixed length lines can, mean several things: Fixed number of characters per line and fixed number of bytes per line are the most common two meanings. In ASCII and other single-byte/character codesets, a fixed number of characters per line and a fixed number of bytes per line happen to be the same thing. With codesets like UTF-8, a character can be represented by one to six bytes. You can also have a fixed number of display columns per line which can be different from both of the above with some characters even having a variable width (such as <tab> typically takes 1 to 8 display columns but can be even more depending on how tab stops are set). And, on displays with proportional width fonts, every character can have a different (and sometimes variable) column width (both for characters like <tab> and due to kerning effects).

brij_abhi · February 10, 2015, 6:34am

Hi RudiC,

somehow your solution is resolved issue but i am not able to accommodate your solution in my code could you help on this as code in my 1st post.

Thanks..

RudiC · February 10, 2015, 7:49am

Not sure about what your request is. If it has to do with the error on Linux, please attach a reasonable part of your input file to your next post. If not, state more precisely what you're after.

brij_abhi · February 10, 2015, 7:59am

Hi RudiC,
Please find the script and the data file in attachments which i have changed from fixed width to pipe delimiter.
I have execute this script as below command:
awk -f split_files.awk Z1GALILEO-ZOR-RECV

change the extension of the file due to facing issue while uploading the attachments.

RudiC · February 10, 2015, 8:50am

Your script works out of the box and yields

H7216152960|70R865931021|70  |999| OPN ID69                               |20141117171817|T1|ZOR |700016901 |4791032964                                    |BVOM.ES@HP.COM         
D70R865931021|0100|  01|  |AJ716B              |000010| PGI TP25                                                   |      48|      48|       0|        |4856|1Y|8OCZ|ZB04|8O00|952|
A70R865931021|0100|  01|TB-ADV  |  |        |        |0001|        
A70R865931021|0100|  01|20141125|  |20141125|20141204|0002|      48
D70R865931021|0200|  01|  |HA115A1             |000020| OPN                                                        |       1|       1|       0|        |5060|UW|7000|ZTDI|    |000|
D
etc.

, no errors. Your input file has some locale-dependent chars in it, though:

file /tmp/Z1GALILEO-ZOR-RECV.txt 
/tmp/Z1GALILEO-ZOR-RECV.txt: ISO-8859 text, with very long lines

You might want to eliminate/convert those with e.g. iconv or recode .