File splitting according to the length of the fields

nua7 · June 9, 2015, 8:24am

Hi All,
I have two files:
1> Data file
2> info file which has field lengths and start position.

Is there a way to create a comma delimited file according to the fields length and start position.

Data file :

R-0000017611N-00000350001095ANZU01
A00000017611N000000350001095ANZU02
R-0000019427N-00000265001202BGYI03
R-0000005977N-00000092001202BGYI03
R-0000017195N-00000995001353B1IZ03
A00000099500N000000995001353B1IZ04
R-0000258547N-00002266002019AXAJ01
A00000258547N000002266002019AXAJ02
R-0000012277N-00000216002026BLCF03
A00000012277N000000216002026BLCF04


Field Name	Length	Start Position

ADJ-TYPE-CODE	1	1
ALLOW-AMT	11	2
CAP-SRVC-NO	1	13
BILL-AMT	11	14

RudiC · June 9, 2015, 9:12am

Any ideas / attempts from your side?

---------- Post updated at 15:12 ---------- Previous update was at 14:52 ----------

However, try

awk '
NR < 3          {next}
NR == FNR       {HD=HD DL $1; DL=","
                 Pos[++n]=$2+$3
                 next}
FNR == 1        {print HD; next}
                {for (i=n; i>0; i--) $(Pos)=DL $(Pos) }
1
' info FS="" OFS="" data
ADJ-TYPE-CODE,ALLOW-AMT,CAP-SRVC-NO,BILL-AMT
R,-0000017611,N,-0000035000,1095ANZU01
A,00000017611,N,00000035000,1095ANZU02
R,-0000019427,N,-0000026500,1202BGYI03
R,-0000005977,N,-0000009200,1202BGYI03
R,-0000017195,N,-0000099500,1353B1IZ03
A,00000099500,N,00000099500,1353B1IZ04
R,-0000258547,N,-0000226600,2019AXAJ01
A,00000258547,N,00000226600,2019AXAJ02
R,-0000012277,N,-0000021600,2026BLCF03
A,00000012277,N,00000021600,2026BLCF04

nua7 · June 9, 2015, 9:26am

Hi,
This is what I tried and works, but the info file can change so wanted to generalize it.

awk '{print substr($1,1,1)","substr($1,11,2)" ,"substr($1,1,13)","substr($1,11,14)" ,"$3}' test

RudiC · June 9, 2015, 9:48am

Version for substr :

awk '
NR < 3          {next}
NR == FNR       {HD=HD DL $1; DL=","
                 L[++n]=$2
                 S[n]  =$3
                 next}
FNR == 1        {print HD
                 next
                }
                {for (i=1; i<=n; i++) printf "%s%s", (i>1?",":""), substr($0, S, L)
                 printf "\n"
                }
' info  data

nua7 · June 10, 2015, 5:18am

Hi Rudi,
This does not work unless I am missing something.

Here is the output that I see

,,
,,
,,
,,
,,
,,
,,
,,
,,
,,
,,
,,
,,
,,
,,
,,
,,
,,
,,
,,
,,
,,
,,
,,
,,
,,
,,

RudiC · June 10, 2015, 7:38am

Both versions yield a meaningful result for me. Which one did you use?

---------- Post updated at 13:38 ---------- Previous update was at 13:36 ----------

The substr version has the last field missing; modify like

awk '
NR < 3          {next}
NR == FNR       {HD=HD DL $1; DL=","
                 L[++n]=$2
                 S[n]  =$3
                 next}
FNR == 1        {print HD
                 next
                }
                {for (i=1; i<=n; i++) printf "%s,", substr($0, S, L)
                 printf "%s\n", substr ($0,S[n]+L[n])
                }
' info  data

Akshay_Hegde · June 10, 2015, 7:45am

This may help you

[akshay@localhost tmp]$ cat description
Field Name	Length	Start Position
ADJ-TYPE-CODE	1	1
ALLOW-AMT	11	2
CAP-SRVC-NO	1	13
BILL-AMT	11	14

[akshay@localhost tmp]$ cat datafile 
R-0000017611N-00000350001095ANZU01
A00000017611N000000350001095ANZU02
R-0000019427N-00000265001202BGYI03
R-0000005977N-00000092001202BGYI03
R-0000017195N-00000995001353B1IZ03
A00000099500N000000995001353B1IZ04
R-0000258547N-00002266002019AXAJ01
A00000258547N000002266002019AXAJ02
R-0000012277N-00000216002026BLCF03
A00000012277N000000216002026BLCF04

[akshay@localhost tmp]$ cat extract.awk
function extract(str,field,i)
{ 
	for(i=1; i<=c; i++)
	{
		field = substr($0,A[i,3],A[i,2])  
		str   = str ? str OFS field : field
	}
		return str		
}

FNR==NR{
		if(NR==1)next
		hdr = hdr ? hdr OFS $1 : $1 	
		c++
		for(i=2; i<=NF; i++)A[c,i]=$i
		next
}

{
	print FNR==1 ? hdr RS extract() : extract()
}

[akshay@localhost tmp]$ awk -vOFS="," -f extract.awk description datafile 
ADJ-TYPE-CODE,ALLOW-AMT,CAP-SRVC-NO,BILL-AMT
R,-0000017611,N,-0000035000
A,00000017611,N,00000035000
R,-0000019427,N,-0000026500
R,-0000005977,N,-0000009200
R,-0000017195,N,-0000099500
A,00000099500,N,00000099500
R,-0000258547,N,-0000226600
A,00000258547,N,00000226600
R,-0000012277,N,-0000021600
A,00000012277,N,00000021600

RudiC · June 10, 2015, 7:52am

Ohhh - does your info file look EXACTLY as posted in #1?

---------- Post updated at 13:52 ---------- Previous update was at 13:50 ----------

@Akshay Hegde: looks like the last field is missing ?

Akshay_Hegde · June 10, 2015, 7:55am

I didn't get you, please explain, looks fine to me

@RudiC from your recent code I get like this

[akshay@localhost tmp]$ cat description
Field Name	Length	Start Position
ADJ-TYPE-CODE	1	1
ALLOW-AMT	11	2
CAP-SRVC-NO	1	13
BILL-AMT	11	14

[akshay@localhost tmp]$ cat datafile 
R-0000017611N-00000350001095ANZU01
A00000017611N000000350001095ANZU02
R-0000019427N-00000265001202BGYI03
R-0000005977N-00000092001202BGYI03
R-0000017195N-00000995001353B1IZ03
A00000099500N000000995001353B1IZ04
R-0000258547N-00002266002019AXAJ01
A00000258547N000002266002019AXAJ02
R-0000012277N-00000216002026BLCF03
A00000012277N000000216002026BLCF04


[akshay@localhost tmp]$ awk '                                             
NR < 3          {next}
NR == FNR       {HD=HD DL $1; DL=","
                 L[++n]=$2
                 S[n]  =$3
                 next}
FNR == 1        {print HD
                 next
                }
                {for (i=1; i<=n; i++) printf "%s,", substr($0, S, L)
                 printf "%s\n", substr ($0,S[n]+L[n])
                }
' description datafile
ALLOW-AMT,CAP-SRVC-NO,BILL-AMT
00000017611,N,00000035000,1095ANZU02  --- 4 fields                                     
-0000019427,N,-0000026500,1202BGYI03
-0000005977,N,-0000009200,1202BGYI03
-0000017195,N,-0000099500,1353B1IZ03
00000099500,N,00000099500,1353B1IZ04
-0000258547,N,-0000226600,2019AXAJ01
00000258547,N,00000226600,2019AXAJ02
-0000012277,N,-0000021600,2026BLCF03
00000012277,N,00000021600,2026BLCF04

RudiC · June 10, 2015, 8:25am

Not sure what the requestor really needs - the entire line with commas at the positions specified, like

A,00000012277,N,00000021600,2026BLCF04

, or (just) the four fields exactly as specified in the info file, like (Akshay Hegde's)

A,00000012277,N,00000021600

In Akshay Hegde's post#9 example, the first field is missing as his info file doesn't have the empty second line that the requestor's sample had in post#1.

nua7 · June 24, 2015, 6:00am

Hi,
I get the following error when I tried to use Akshay's code. Is it related to the unix flavor I use ? I am working on HP UX

 syntax error The source line is 20.
 The error context is
                        print >>>  FNR== <<<
 awk: The statement cannot be correctly parsed.
 The source line is 20.

---------- Post updated at 03:30 PM ---------- Previous update was at 03:28 PM ----------

@Rudic, you are correct. I need the format as below.

A,00000012277,N,00000021600,2026BLCF04

RudiC · June 24, 2015, 10:42am

Try parenthesizing the entire conditional assignment.