Shell script to put delimiter for a no delimiter variable length text file

Gaurav_Martha · January 28, 2013, 11:44am

Hi,

I have a No Delimiter variable length text file with following schema -

Column Name    Data length
Firstname           5
Lastname            5
age                      3
phoneno1            10
phoneno2            10
phoneno3            10

sample data -

sunilbassi031012345678901234567890123456789
sunilbassi03101234567890123456789
sunilbassi0310123456789

Output data -

sunil,bassi,031,0123456789,0123456789,0123456789
sunil,bassi,031,0123456789,0123456789,
sunil,bassi,031,0123456789,,

using shell script. Can anyone please help me to write script for same.

Yoda · January 28, 2013, 11:52am

awk '{ print substr($0,1,5),substr($0,6,5),substr($0,11,3),substr($0,14,10),substr($0,24,10),substr($0,34,10); } ' OFS=, filename

rdrtx1 · January 28, 2013, 12:08pm

try also:

awk '
NR==FNR {if (NF==2) field[++fc]=$2 ; next}
{ el=$0; cc=1; for (i=1; i<=fc; i++) {$i=substr(el,cc,field); cc+=field}}
1
' OFS=, schema_file data_file

Gaurav_Martha · January 28, 2013, 12:26pm

Thanks Bipin :).. Though My Requirement is quite big.
I mean.. Suppose I have 1500 character data length for say 100 columns file and having length for all columns (as defined earlier example - not fixed) out of which -

sunilbassi031012345678901234567890123456789............1500(Characters)

say 1200 character are fixed length for say 70 columns and remaining 30 columns are having variable length. Now as mentioned in previous example -
Maximum length of all columns will be 1500
and least will be 1200.

sunilbassi031012345678901234567890123456789......1500 (100 columns)
sunilbassi031012345678901234567890123456789......1500 (90 columns)
sunilbassi031012345678901234567890123456789......1500 (80 columns)
sunilbassi031012345678901234567890123456789......1200 (70 columns) Least

Now I have to separate all these 100 columns with a delimiter and require some automation (Cause I have a situation where I need to separate 900 columns), as defining substring every time doesn't seems feasible.

Thanks in Advance!!

Yoda · January 28, 2013, 12:44pm

I don't understand. If remaining columns are variable length, then what criteria you will use to separate the data?

BTW If the columns are fixed length then you can define field lengths in a file or variable and perform the separation.

Gaurav_Martha · January 28, 2013, 12:54pm

Bipin,

By Variable length I mean.. Like in previous example I taken three number, so some one may have three different phone number and someone may have none...

sunilbassi031012345678901234567890123456789 
sunilbassi03101234567890123456789 
sunilbassi0310123456789

I have Max length defined for all columns and individual column lengths as well.
Now I want to separate them through delimiter.

Furthermore, If spaces can be added to this variable length file and convert this into fixed length comma separated file. eg:

sunil,bassi,031,0123456789,0123456789,0123456789 
sunil,bassi,031,0123456789,0123456789,           
sunil,bassi,031,0123456789,          ,

---------- Post updated at 12:54 PM ---------- Previous update was at 12:53 PM ----------

Could you please care to explain.

rdrtx1 · January 28, 2013, 1:21pm

 
awk '
NR==FNR {if (NF==2) field[++fc]=$2 ; next}                                      # read column widths from schema file (second field)
{ el=$0; cc=1; for (i=1; i<=fc; i++) {$i=substr(el,cc,field); cc+=field}} # split line based on columnt widths
1                                                                               # print line (1 = true = print line)
' OFS=, schema_file data_file                                                   # use , as field separator, specify input files

Gaurav_Martha · January 28, 2013, 1:33pm

rdrtx1:

 
awk '
NR==FNR {if (NF==2) field[++fc]=$2 ; next}                                      # read column widths from schema file (second field)
{ el=$0; cc=1; for (i=1; i<=fc; i++) {$i=substr(el,cc,field); cc+=field}} # split line based on columnt widths
1                                                                               # print line (1 = true = print line)
' OFS=, schema_file data_file                                                   # use , as field separator, specify input files

Thanks Rdrtx1 for your efforts I am newbie on Unix.. Basically a Datawarehouse guy.. So again thanks..
BTW do we require to create specific schema file in this scenario.

rdrtx1 · January 28, 2013, 1:48pm

In this example solution yes. I think the schema file is easily maintainable if it needs to be updated.

Gaurav_Martha · January 28, 2013, 2:17pm

Okay..it would be really great if you can please share sample schema file for the example that taken earlier.. "pardon me if i'm asking too much"

---------- Post updated at 02:17 PM ---------- Previous update was at 02:10 PM ----------

One more thing would require your input... Will the script that you mentioned be able to put comma (delimiter) for field which for which no data is there example -

Sunilbassi031012345678901234567890123456789
Sunilbassi03101234567890123456789
Sunilbassi0310123456789

That is for 2nd and 3rd record.. Will all of them will be having 5 commas(delimiter)

rdrtx1 · January 28, 2013, 3:03pm

for the example the output will be like:

sunil,bassi,031,0123456789,0123456789,0123456789
yyyil,bassi,031,0123456789,0123456789,
xxxil,bassi,031,0123456789,,

DGPickett · January 28, 2013, 3:58pm

printf can take max length aruments. But it feels like a perl or C/C++/JAVA level task. But bash has substring: ${parameter:offset:length}, so just read the lengths into two arrays: length and offset, and go through them taking apart every line.

Gaurav_Martha · January 29, 2013, 6:47am

Thanks Rdrtx1, could you please help me by providing the schema file(Or any sample for reference) for the mentioned example.

DGPickett · January 29, 2013, 10:26am

The inner loop of any solution is a for every line for n = 0..max print substring of line offset[n] length[n]; if n = max then print lsep, break else print fsep continue. Make sure your solution is stable for short lines (truncates any partial field, missing fields all go blank). Setup will be to take in the field length file and set offset[0..n] and length [0..n] using length. It can be argued that recalculating offset on the fly might be faster than fetching it from an array. Try both and see, if speed is still an issue. I would write it in C as a state machine copying one byte at a time c = getchar(); . . . putchar( c ) and inserting field separators as lengths are exhausted. That is about as fast as it gets.

Gaurav_Martha · January 30, 2013, 4:53am

Hi rdrtx1,

I tried method -

$ cat sample.txt
sunilbassi031012345678901234567890123456789
sunilbassi03101234567890123456789
sunilbassi0310123456789

$ cat scema.txt
LASTNAME 5
FIRST-NAME 5
AGE 3
PHONE1 10
PHONE2 10
PHONE3 10

and the method which we discussed -

$ awk '
> NR==FNR {if (NF==2) field[++fc]=$2 ; next}
> { el=$0; cc=1; for (i=1; i<=fc; i++) {$i=substr(el,cc,field); cc+=field}}
> 1
> ' OFS=, scema.txt sample.txt
sunil,bassi,031,0123456789,0123456789,0123456789
sunil,bassi,031,0123456789,0123456789,
sunil,bassi,031,0123456789,

outcome doesn't have two commas for last record. Can you please look into this and confirm.

DGPickett · January 30, 2013, 5:09pm

Seems like for the short last line there should be two trailing commas (because it cares) or none (because it does not care and they are a waste).

Gaurav_Martha · February 4, 2013, 4:23am

You are correct, Tis working beautifully and outcome is same as you mentioned, I was using cygwin as my Unix server was down, I tried it with Putty after server was up today and as mentioned "Working Beautifully".

Many Thanks!!!
Gaurav