Split and add header and trailer from input file

techedipro · June 2, 2014, 10:25am

I need to split the file based on pattern from position 34-37 while retaining the header and trailer records in each individual split file

Also is it possible to output the TOM and PAT records in the same output file ?

I need the output file names same as xyz_pattern_Datetimestamp.txt

 
H00000012345678900000000 xxxxxxxxxxxxxx
D00000012300000000000000 xxxxxxxxTOMxxx
D00000045600000000000000 xxxxxxxxTOMxxx
D00000078900000000000000 xxxxxxxxPATxxx
D00000065000000000000000 xxxxxxxxPATxxx
D00000023100000000000000 xxxxxxxxPATxxx
D00000013200000000000000 xxxxxxxxSAMxxx
D00000036500000000000000 xxxxxxxxSAMxxx
T00000025800000000000000 xxxxxxxxxxxxxxx

So far I was able to split the file with the below command but struggling with header and trailer on individual split file and also with the output file name

 
awk -F "" 'BEGIN {x=""} NR>1 { x=substr($0,34,3); print $0 >> x".txt" }' Test.txt

RudiC · June 2, 2014, 11:31am

Welcome to the forum, and congrats for using code tags correctly right from the start!

Still, some questions remain:
Where's the header and the trailer?
Does the order of lines matter?
If you want TOM and PAT to go to the same file, which pattern should be used for the file name?
Do they have to be combined in the awk script, or can they be concatenated afterwards?
And, surely this is a typo, but you extract three chars in your script, but specify four (34 - 37).

techedipro · June 2, 2014, 12:07pm

Thanks!

The first record in the input file is the Header record , the last record in the input file is the trailer record . Yes the order matters, the records on the output file should appear in the same sequence as they are on the input file

The first pattern TOM should be used for the output file name in the scenario where we are grouping the records i.e. TOM and PAT records going in to the same output file

Yes the pattern to look for is 3 characters from position 34

MadeInGermany · June 2, 2014, 3:25pm

This is a true split:

awk '
{
  if (/^H/) {header=$0}
  else if (/^T/) {trailer=$0}
  else {
    fname=substr($0,34,3)".txt"
    if (!(fname in A)) {A[fname]; print header > fname}
    print $0 >> fname
  }
}
END {
  for (fname in A) print trailer >> fname
}
' Test.txt

techedipro · June 2, 2014, 4:59pm

Thank you , the above awk command splitted the files with header and trailer in each individual file

left with grouping of TOM and PAT records and the output file name
as xyz_pattern_Datetimestamp.txt

Chubler_XL · June 2, 2014, 8:09pm

Try this:

awk -v stamp=$(date +%Y%m%d%H%M%S) '
{
  if (/^H/) {header=$0}
  else if (/^T/) {trailer=$0}
  else {
    typ=substr($0,34,3)
    if(typ == "PAT") typ="TOM"
    fname="xyz_" typ "_" stamp ".txt"
    if (!(fname in A)) {A[fname]; print header > fname}
    print $0 >> fname
    close(fname)
  }
}
END {
  for (fname in A) {
    print trailer >> fname
    close(fname)
  }
}
' Test.txt

Note: You will need the close(fname) statements if you ever get more than 20 or so record types as awk has a limited number of open files.

techedipro · June 2, 2014, 11:13pm

Thanks Chubler!

Can you please explain your code

Chubler_XL · June 2, 2014, 11:29pm

-v stamp=$(date +%Y%m%d%H%M%S) assign stamp variable with current date and time in YYYYMMDDhhmmss format - Note this is the time the process starts and all files produced in this session will have the same timestamp, even if the process takes a long time to run.

typ=substr($0,34,3) extract 3 chars from 34-36 into typ variable.

fname="xyz_" typ "_" stamp ".txt" build filename string "xyz_" plus typ plus "_" plus timestame plus ".txt" eg (xyz_TOM_201403061355.txt)

if(typ == "PAT") typ="TOM" put PAT lines into TOM file - this forces both TOM and PAT lines into the same file with xyz_TOM_... name.

if (!(fname in A)) {A[fname]; print header > fname} If this is the first line to go into this file (ie fname is not in the A[] array) then store fname in A[] array and output header to fname file.

close(fname) close file handle (this will be re-opened next time a line is appended, this is slower as lots of closing and opening of file buffers but awk is limited in the number of file buffers allowed to be open at once.

print $0 >> fname append current line to fname file

The END block writes the trailer line to each of the end of all files produced in this session once the end of the input file is reached.

techedipro · June 3, 2014, 1:07am

Thank you again!

A small change , I need to group three patterns in to the same file i.e. TOM ,PAT and SAM in the same file , with file name as xyz_TOM_Datetimestamp

Also can how can I capture records with spaces and any records that do not match pattern TOM/PAT/SAM and put them in a file as error_records_datetimestamp

Here is my updated input file

 
H00000012345678900000000 xxxxxxxxxxxxxx
D00000012300000000000000 xxxxxxxxTOMxxx
D00000045600000000000000 xxxxxxxxTOMxxx
D00000078900000000000000 xxxxxxxxPATxxx
D00000065000000000000000 xxxxxxxxPATxxx
D00000023100000000000000 xxxxxxxxPATxxx
D00000013200000000000000 xxxxxxxxSAMxxx
D00000036500000000000000 xxxxxxxxSAMxxx
D00000036500000000000000 xxxxxxxxBOBxxx
D00000036500000000000000 xxxxxxxxBOBxxx
D00000036500000000000000 xxxxxxxx   xxx
D00000036500000000000000 xxxxxxxx   xxx
T00000025800000000000000 xxxxxxxxxxxxxx

RudiC · June 3, 2014, 5:47am

Try this modification of Chubler_XL's solution:

awk -v stamp=$(date +%Y%m%d%H%M%S) '
         BEGIN          {fname1="xyz_TOM_" stamp ".txt"
                         fname2="error_records_" stamp ".txt"}
         /^ *$/         {next}
         /^(H|T)/       {print $0 >> fname1
                         print $0 >> fname2
                         next}
         substr($0,34,3) ~ /PAT|SAM|TOM/    \
                        {print $0 >> fname1; next}
                        {print $0 >> fname2}
        ' file

techedipro · June 3, 2014, 10:40am

Thanks!

This works good for patterns TOM/PAT/SAM/Spaces

The moment I have a new pattern , I need to adjust the code . If so where should my changes go in the code ?

From the below input file

TOM/PAT/SAM records should go in the file "xyz_TOM_Timestamp.txt"
BOB records should go in "xyz_BOB_Timestamp.txt"
KIM records should go in " xyz_KIM_Timestamp.txt"
Spaces and any records that are other than TOM/PAT/SAM/BOB/KIM should go in "xyz_error_Timestamp.txt

Please advise

 
H00000012345678900000000 xxxxxxxxxxxxxx
D00000012300000000000000 xxxxxxxxTOMxxx
D00000045600000000000000 xxxxxxxxTOMxxx
D00000078900000000000000 xxxxxxxxPATxxx
D00000065000000000000000 xxxxxxxxPATxxx
D00000023100000000000000 xxxxxxxxPATxxx
D00000013200000000000000 xxxxxxxxSAMxxx
D00000036500000000000000 xxxxxxxxSAMxxx
D00000036500000000000000 xxxxxxxxBOBxxx
D00000036500000000000000 xxxxxxxxBOBxxx
D00000036500000000000000 xxxxxxxx   xxx
D00000036500000000000000 xxxxxxxxKIMxxx
D00000036500000000000000 xxxxxxxxKIMxxx
D00000036500000000000000 xxxxxxxx   xxx
D00000036500000000000000 xxxxxxxxDANxxx
D00000036500000000000000 xxxxxxxxDANxxx
T00000025800000000000000 xxxxxxxxxxxxxx

Chubler_XL · June 3, 2014, 4:37pm

how about this, it should be fairly easy for you to change the f (from) and t (to) variables on the command line to whatever you like:

awk -v stamp=$(date +%Y%m%d%H%M%S) \
    -v f="TOM PAT SAM BOB KIM" \
    -v t="TOM TOM TOM BOB KIM" '
BEGIN {
   split(f,from)
   for(i=split(t,to);i;i--) CONV[from]=to
}
/^H/ {header=$0 ; next}
/^T/ {trailer=$0 ; next}
{
   typ=substr($0,34,3)
   if(typ in CONV) fname="xyz_" CONV[typ] "_" stamp ".txt"
   else fname="xyz_error_" stamp ".txt"
   if (!(fname in A)) {A[fname]; print header > fname}
   print $0 >> fname
   close(fname)
}
END {
  for (fname in A) print trailer >> fname
}' Test.txt

If there are only a few of these from->to pairs the above should work fine, but if you find yourself with a large list if may be worth a different approach, like putting them is another translate.txt file and changing the code to load this first.

techedipro · June 4, 2014, 12:46am

Superb, works like a charm!

One small change the error file has header and trailer records , can we reduce it to only carry the actual records that have spaces in position 34-36 as well as those records that are not a part of the split condition

Thanks

Chubler_XL · June 4, 2014, 3:25pm

Sure, this should avoid header/trailer records in error file:

awk -v stamp=$(date +%Y%m%d%H%M%S) \
    -v f="TOM PAT SAM BOB KIM" \
    -v t="TOM TOM TOM BOB KIM" '
BEGIN {
   split(f,from)
   for(i=split(t,to);i;i--) CONV[from]=to
}
/^H/ {header=$0 ; next}
/^T/ {trailer=$0 ; next}
{
   typ=substr($0,34,3)
   if (typ in CONV) {
       fname="xyz_" CONV[typ] "_" stamp ".txt"
       if (!(fname in A)) {A[fname]; print header > fname}
   } 
   else fname="xyz_error_" stamp ".txt"
   print $0 >> fname
   close(fname)
}
END {
  for (fname in A) print trailer >> fname
}' Test.txt

techedipro · June 9, 2014, 1:09pm

Thank you.

New line characters are appearing in the error file if they happen to present in the input file , how can I avoid them ?

Also can you briefly explain the new portion of the code

Appreciate your help!

Chubler_XL · June 9, 2014, 6:26pm

Can you supply a demo of the input file with this new line characters, are you talking about blank/empty lines?

To explain the new code:

    -v f="TOM PAT SAM BOB KIM" \
    -v t="TOM TOM TOM BOB KIM" '
BEGIN {
   split(f,from)
   for(i=split(t,to);i;i--) CONV[from]=to
}

Above we pass into awk two space separated lists f and t (for from and to). These are loaded into a CONV lookup array. Such that:

CONV[TOM] => TOM 
CONV[PAT] => TOM
CONV[KIM] => KIM

Below we check the line and if it's setup in the CONV array we assign the fname and write the header line to the file when this is the first time this file is written. If line is not setup in CONV array fname is "xyz_error_" plus time-stamp instead:

   if (typ in CONV) {
       fname="xyz_" CONV[typ] "_" stamp ".txt"
       if (!(fname in A)) {A[fname]; print header > fname}
   } 
   else fname="xyz_error_" stamp ".txt"

techedipro · June 10, 2014, 1:40pm

Yes, I am talking about line feed characters at the end of the file

 
H00000012345678900000000 xxxxxxxxxxxxxx
D00000012300000000000000 xxxxxxxxTOMxxx
D00000045600000000000000 xxxxxxxxTOMxxx
D00000078900000000000000 xxxxxxxxPATxxx
D00000065000000000000000 xxxxxxxxPATxxx
D00000023100000000000000 xxxxxxxxPATxxx
D00000013200000000000000 xxxxxxxxSAMxxx
D00000036500000000000000 xxxxxxxxSAMxxx
D00000036500000000000000 xxxxxxxxBOBxxx
D00000036500000000000000 xxxxxxxxBOBxxx
D00000036500000000000000 xxxxxxxx   xxx
D00000036500000000000000 xxxxxxxx   xxx
T00000025800000000000000 xxxxxxxxxxxxxx

Chubler_XL · June 11, 2014, 5:45pm

How about this change to ignore lines with 3 or less characters:

awk -v stamp=$(date +%Y%m%d%H%M%S) \
    -v f="TOM PAT SAM BOB KIM" \
    -v t="TOM TOM TOM BOB KIM" '
BEGIN {
   split(f,from)
   for(i=split(t,to);i;i--) CONV[from]=to
}
/^H/ {header=$0 ; next}
/^T/ {trailer=$0 ; next}
length>3{
   typ=substr($0,34,3)
   if (typ in CONV) {
       fname="xyz_" CONV[typ] "_" stamp ".txt"
       if (!(fname in A)) {A[fname]; print header > fname}
   } 
   else fname="xyz_error_" stamp ".txt"
   print $0 >> fname
   close(fname)
}
END {
  for (fname in A) print trailer >> fname
}' Test.txt

techedipro · June 16, 2014, 10:16am

Thanks Chubler ..greatly appreciate it !

techedipro · July 6, 2014, 4:55pm

One more change ,I need to replace a bunch of special characters on the input file while I am splitting it

Below is the list that I need to find and then replace with the corresponding value

This list could grow..I need to be able to control this

Character Found    Character Replaced
ED                SPACE
E9                SPACE
D9                SPACE
C2                SPACE
80                SPACE
99                SPACE
E2                SPACE
0D0A              0A
C3                SPACE
C4                SPACE
C8                SPACE
C9                SPACE
A0                SPACE
FF                SPACE
B1                SPACE
83                SPACE
C1                SPACE
E1                SPACE
B7                SPACE
F6                SPACE
F1                SPACE
F3                SPACE
E3                SPACE
1A                SPACE

Appreciate your help!