Shell script to put delimiter for a no delimiter variable length text file

Hi,

I have a No Delimiter variable length text file with following schema -

Column Name    Data length
Firstname           5
Lastname            5
age                      3
phoneno1            10
phoneno2            10
phoneno3            10

sample data -

sunilbassi031012345678901234567890123456789
sunilbassi03101234567890123456789
sunilbassi0310123456789

Output data -

sunil,bassi,031,0123456789,0123456789,0123456789
sunil,bassi,031,0123456789,0123456789,
sunil,bassi,031,0123456789,,

using shell script. Can anyone please help me to write script for same.

awk '{ print substr($0,1,5),substr($0,6,5),substr($0,11,3),substr($0,14,10),substr($0,24,10),substr($0,34,10); } ' OFS=, filename
1 Like

try also:

awk '
NR==FNR {if (NF==2) field[++fc]=$2 ; next}
{ el=$0; cc=1; for (i=1; i<=fc; i++) {$i=substr(el,cc,field); cc+=field}}
1
' OFS=, schema_file data_file
1 Like

Thanks Bipin :).. Though My Requirement is quite big.
I mean.. Suppose I have 1500 character data length for say 100 columns file and having length for all columns (as defined earlier example - not fixed) out of which -

sunilbassi031012345678901234567890123456789............1500(Characters)

say 1200 character are fixed length for say 70 columns and remaining 30 columns are having variable length. Now as mentioned in previous example -
Maximum length of all columns will be 1500
and least will be 1200.

sunilbassi031012345678901234567890123456789......1500 (100 columns)
sunilbassi031012345678901234567890123456789......1500 (90 columns)
sunilbassi031012345678901234567890123456789......1500 (80 columns)
sunilbassi031012345678901234567890123456789......1200 (70 columns) Least 

Now I have to separate all these 100 columns with a delimiter and require some automation (Cause I have a situation where I need to separate 900 columns), as defining substring every time doesn't seems feasible.

Thanks in Advance!!

I don't understand. If remaining columns are variable length, then what criteria you will use to separate the data?

BTW If the columns are fixed length then you can define field lengths in a file or variable and perform the separation.

Bipin,

By Variable length I mean.. Like in previous example I taken three number, so some one may have three different phone number and someone may have none...

sunilbassi031012345678901234567890123456789 
sunilbassi03101234567890123456789 
sunilbassi0310123456789

I have Max length defined for all columns and individual column lengths as well.
Now I want to separate them through delimiter.

Furthermore, If spaces can be added to this variable length file and convert this into fixed length comma separated file. eg:

sunil,bassi,031,0123456789,0123456789,0123456789 
sunil,bassi,031,0123456789,0123456789,           
sunil,bassi,031,0123456789,          ,

---------- Post updated at 12:54 PM ---------- Previous update was at 12:53 PM ----------

Could you please care to explain.

 
awk '
NR==FNR {if (NF==2) field[++fc]=$2 ; next}                                      # read column widths from schema file (second field)
{ el=$0; cc=1; for (i=1; i<=fc; i++) {$i=substr(el,cc,field); cc+=field}} # split line based on columnt widths
1                                                                               # print line (1 = true = print line)
' OFS=, schema_file data_file                                                   # use , as field separator, specify input files

Thanks Rdrtx1 for your efforts :slight_smile: I am newbie on Unix.. Basically a Datawarehouse guy.. So again thanks..
BTW do we require to create specific schema file in this scenario.

In this example solution yes. I think the schema file is easily maintainable if it needs to be updated.

1 Like

Okay..it would be really great if you can please share sample schema file for the example that taken earlier.. "pardon me if i'm asking too much" :slight_smile:

---------- Post updated at 02:17 PM ---------- Previous update was at 02:10 PM ----------

One more thing would require your input... Will the script that you mentioned be able to put comma (delimiter) for field which for which no data is there example -

Sunilbassi031012345678901234567890123456789
Sunilbassi03101234567890123456789
Sunilbassi0310123456789

That is for 2nd and 3rd record.. Will all of them will be having 5 commas(delimiter)

for the example the output will be like:

sunil,bassi,031,0123456789,0123456789,0123456789
yyyil,bassi,031,0123456789,0123456789,
xxxil,bassi,031,0123456789,,
1 Like

printf can take max length aruments. But it feels like a perl or C/C++/JAVA level task. But bash has substring: ${parameter:offset:length}, so just read the lengths into two arrays: length and offset, and go through them taking apart every line.

Thanks Rdrtx1, could you please help me by providing the schema file(Or any sample for reference) for the mentioned example.

The inner loop of any solution is a for every line for n = 0..max print substring of line offset[n] length[n]; if n = max then print lsep, break else print fsep continue. Make sure your solution is stable for short lines (truncates any partial field, missing fields all go blank). Setup will be to take in the field length file and set offset[0..n] and length [0..n] using length. It can be argued that recalculating offset on the fly might be faster than fetching it from an array. Try both and see, if speed is still an issue. I would write it in C as a state machine copying one byte at a time c = getchar(); . . . putchar( c ) and inserting field separators as lengths are exhausted. That is about as fast as it gets.

Hi rdrtx1,

I tried method -

$ cat sample.txt
sunilbassi031012345678901234567890123456789
sunilbassi03101234567890123456789
sunilbassi0310123456789
$ cat scema.txt
LASTNAME 5
FIRST-NAME 5
AGE 3
PHONE1 10
PHONE2 10
PHONE3 10

and the method which we discussed -

$ awk '
> NR==FNR {if (NF==2) field[++fc]=$2 ; next}
> { el=$0; cc=1; for (i=1; i<=fc; i++) {$i=substr(el,cc,field); cc+=field}}
> 1
> ' OFS=, scema.txt sample.txt
sunil,bassi,031,0123456789,0123456789,0123456789
sunil,bassi,031,0123456789,0123456789,
sunil,bassi,031,0123456789,

outcome doesn't have two commas for last record. Can you please look into this and confirm.

Seems like for the short last line there should be two trailing commas (because it cares) or none (because it does not care and they are a waste).

You are correct, Tis working beautifully and outcome is same as you mentioned, I was using cygwin as my Unix server was down, I tried it with Putty after server was up today and as mentioned "Working Beautifully".

Many Thanks!!! :slight_smile:
Gaurav