awk issue splitting a fixed-width file containing line feed in data

Hi Forum.

I have the following script that splits a large fixed-width file into smaller multiple fixed-width files based on input segment type.

The main command in the script is:

awk -v search_col_pos=$search_col_pos -v search_str_len=$search_str_len -v segment_type="$segment_type" 'substr($0, search_col_pos, search_str_len) ~ segment_type {print $0}'

To call the script:

file_splitter.sh filename CO 33 2

Where CO is the segment_type to search for ($segment_type)
           33 is the start search position ($search_col_pos)
             2 is the length of the search string ($search_str_len)

The issue that I'm having is that some of the data columns in the source file contains some binary values along with a LF (Line Feed) character in it.

As a result, the awk command interprets the LF in the data as the end of the line and only partially returns the record. Anything after the LF in the data column is not returned. Therefore, I have an incomplete record.

How do I handle this issue?

Thanks.

Wouldn't a decent, representative input sample be nice?

Sure can. Thanks.

LF is on the 2nd record (After bolded red text)

00003061895700000000208001X UK1001249751+0000000752+0000000753+0000000754+0000000755+0000000756+0000000757+0000000758+0000000759+0000000760+0000000761+0000000762+0000000763+0000000764+0000000765+0000000766+0000000767+0000000768+0000000769+0000000770+0000000771+0000000772+0000000773+0000000774+0000000775+0000000776+0000000777+0000000778+0000000779+0000000780+0000000781+0000000782+0000000783+0000000784+0000000785+0000000786+0000000787+0000000788+0000000789+0000000790+0000000791+0000000792+0000000793+0000000794+0000000795+0000000796+0000000797+0000000798+0000000799+0000000800+0000000801+0000000802+0000000803+0000000804+0000000805+0000000806+0000000807+0000000808+0000000809+0000000810+0000000811+0000000812+0000000813+0000000814+0000000815+0000000816+0000000817+0000000818+0000000819+0000000820+0000000821+0000000822+0000000823+0000000824+0000000825+0000000826+0000000827+0000000828+0000000829+0000000830+0000000831+0000000832+0000000833+0000000834+0000000835+0000000836+0000000837+0000000838+0000000839+0000000840+0000000841+0000000842+0000000843+0000000844+0000000845+0000000846+0000000847+0000000848+0000000849+0000000850+0000000851+0000000852+0000000853+0000000854+0000000855+0000000856+0000000857+0000000858+0000000859+0000000860+0000000861+0000000862+0000000863+0000000864+0000000865+0000000866+0000000867+0000000868+0000000869+0000000870+0000000871+0000000872+0000000873+0000000874+0000000875+0000000876+0000000877+0000000878+0000000879+0000000880+0000000881+0000000882+0000000883+0000000884+0000000885+0000000886+0000000887+0000000888+0000000889+0000000890+0000000891+0000000892+0000000893+0000000894+0000000895+0000000896+0000000897+0000000898+0000000899+0000000900+0000000901+0000000902+0000000903+0000000904+0000000905+0000000906+0000000907+0000000908+0000000909+0000000910+0000000911+0000000912+0000000913+0000000914+0000000915+0000000916+0000000917+0000000918+0000000919+0000000920+0000000921+0000000922+0000000923+0000000924+0000000925+0000000926+0000000927+0000000928+0000000929+0000000930+0000000931+0000000932+0000000933+0000000934+0000000935+0000000936+0000000937+0000000938+0000000939+0000000940+0000000941+0000000942+0000000943+0000000944+0000000945+0000000946+0000000947+0000000948+0000000949+0000000950+0000000951+0000000952+0000000953+0000000954+0000000955+0000000956+0000000957+0000000958+0000000959+0000000960+0000000961+0000000962+0000000963+0000000964+0000000965+0000000966+0000000967+0000000968+0000000969+0000000970+0000000971+0000000972+0000000973+0000000974+0000000975+0000000976+0000000977+0000000978+0000000979+0000000980+0000000981+0000000982+0000000983+0000000984+0000000985+0000000986+0000000987+0000000988+0000000989+0000000990+0000000991+0000000992+0000000993+0000000994+0000000995+0000000996+0000000997+0000000998+0000000999+0000000
00003061895700000000208001X UF1101600000000000000000001001111111111110000000000000000000000000000000000000000000000000000100002000000000000000000000ON 000E MC02+000000000+000000000+000000091+000000000+000000100002+000001100004+000001420005+000002020003+000001200+000000000+000000000000000+000000404+000000420+000000020+000000018001+00000010000006000270000000000000000000000078200000000000001000000000000000000000+199500+199500+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000002+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+0000000+00000000000000900000100000000000000100000000000000000000240000000000+0001003+0000001+0000001+0000010+0000103+0000000+0000000+0000000+0000000+0000000+0000086+00000000000000000000000000020170220170208201612000000000000000000000000000000000000+00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000020150900201509N00000000000000000000000000000000000000000000000000000000000000000000000400N    5    166  7                             17029029                                                    002004005006                                                                                                        
�     ��       D                     -�                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
00003061895700000000212001X CU0101306189570000000021200 X 00000000000000000000201509150000000+00000000100001982030600000000000000000001009915903222177622001000004999001001+00001200020150915+000012000+000000000+0000000000001000004999001001+00000360020150915+000003600+000000000+0000000000001000004999001001+00000000000000000+000000500+000000000+0000000000000000000000000000+00000000000000000+000000000+000000000+0000000000000000000000000000+00000000000000000+000000000+000000000+0000000000000000000000000000+00000000000000000+000000000+000000000+0000000000000000000000000000+00000000000000000+000000000+000000000+0000000000000000000000000000+00000000000000000+000000000+000000000+0000000000000000000000000000+00000000000000000+000000000+000000000+0000000000000000000000000000+00000000000000000+000000000+000000000+0000000000+000000000+000000000+000000000+000000000+000000000+000000000+000000000+0000000000000000000+000000000+00000000000+000000000+000000000+000000000+000000000+000000000+000000000+000000000+000000000+000000000+00000000000000000000000000000+000000000+000000000+000000000+000000000+000000000+000000000     00000000+000000000     00000000+00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000008040080200802008020082800828008280080200802000000080100801000000000+00023876+00000000CON120170208000P010

So - the records are fixed width PLUS <LF> terminated? How come a "split" record is WAY longer than a normal record (not counting the spaces at the end), and why do the two normal records (1 & 3) differ in length? None of the records seem to have structure elements in common?

Hi RudiC.

If you look at position 33, there are 3 different types of segments (UK, UX, CU).

That's why we cannot process the source file as is since each segment can have different data lengths.

We need to split the source file by segment types first and then process each sub file separately.

I hope I answered your question.

Thanks.

for a program to process this file correctly you will need to list every possible segment type and it's length. The awk script can then fetch characters without relying on <LF> characters which can randomly appear within the actual segment.

Try something like this to reconstruct the lines before processing:

awk -v search_col_pos=$search_col_pos -v search_str_len=$search_str_len -v segment_type="$segment_type" '
  function cond_print(record) { 
    if (substr(record, search_col_pos, search_str_len) ~ segment_type) print record
  }
  !/^[0-9]{25}/ {
    prev=prev RS $0
    next
  }
  {
    if(NR>1) cond_print(prev);
    prev=$0
  } 
  END {
    cond_print(prev)
  }
' file

Hi Scrutinizer.

Thank you for the code suggestion. I tried your code above with the input_file.txt attached but I didn't get any data in the output file.

This is what I executed:

awk -v search_col_pos=33 -v search_str_len=2 -v segment_type="UF" '
  function cond_print(record) { 
    if (substr(record, search_col_pos, search_str_len) ~ segment_type) print record
  }
  !/^[0-9]{25}/ {
    prev=prev RS $0
    next
  }
  {
    if(NR>1) cond_print(prev);
    prev=$0
  } 
  END {
    cond_print(prev)
  }
' input_file.txt > output_file.txt

Can you please help?

Hi, your previous sample contains "UF" at position 29, not at 33 . And this sample does not contain "UF" at all.