So far I was able to split the file with the below command but struggling with header and trailer on individual split file and also with the output file name
Welcome to the forum, and congrats for using code tags correctly right from the start!
Still, some questions remain:
Where's the header and the trailer?
Does the order of lines matter?
If you want TOM and PAT to go to the same file, which pattern should be used for the file name?
Do they have to be combined in the awk script, or can they be concatenated afterwards?
And, surely this is a typo, but you extract three chars in your script, but specify four (34 - 37).
The first record in the input file is the Header record , the last record in the input file is the trailer record . Yes the order matters, the records on the output file should appear in the same sequence as they are on the input file
The first pattern TOM should be used for the output file name in the scenario where we are grouping the records i.e. TOM and PAT records going in to the same output file
Yes the pattern to look for is 3 characters from position 34
-v stamp=$(date +%Y%m%d%H%M%S) assign stamp variable with current date and time in YYYYMMDDhhmmss format - Note this is the time the process starts and all files produced in this session will have the same timestamp, even if the process takes a long time to run.
typ=substr($0,34,3) extract 3 chars from 34-36 into typ variable.
fname="xyz_" typ "_" stamp ".txt" build filename string "xyz_" plus typ plus "_" plus timestame plus ".txt" eg (xyz_TOM_201403061355.txt)
if(typ == "PAT") typ="TOM" put PAT lines into TOM file - this forces both TOM and PAT lines into the same file with xyz_TOM_... name.
if (!(fname in A)) {A[fname]; print header > fname} If this is the first line to go into this file (ie fname is not in the A[] array) then store fname in A[] array and output header to fname file.
close(fname) close file handle (this will be re-opened next time a line is appended, this is slower as lots of closing and opening of file buffers but awk is limited in the number of file buffers allowed to be open at once.
print $0 >> fname append current line to fname file
The END block writes the trailer line to each of the end of all files produced in this session once the end of the input file is reached.
A small change , I need to group three patterns in to the same file i.e. TOM ,PAT and SAM in the same file , with file name as xyz_TOM_Datetimestamp
Also can how can I capture records with spaces and any records that do not match pattern TOM/PAT/SAM and put them in a file as error_records_datetimestamp
The moment I have a new pattern , I need to adjust the code . If so where should my changes go in the code ?
From the below input file
TOM/PAT/SAM records should go in the file "xyz_TOM_Timestamp.txt"
BOB records should go in "xyz_BOB_Timestamp.txt"
KIM records should go in " xyz_KIM_Timestamp.txt"
Spaces and any records that are other than TOM/PAT/SAM/BOB/KIM should go in "xyz_error_Timestamp.txt
how about this, it should be fairly easy for you to change the f (from) and t (to) variables on the command line to whatever you like:
awk -v stamp=$(date +%Y%m%d%H%M%S) \
-v f="TOM PAT SAM BOB KIM" \
-v t="TOM TOM TOM BOB KIM" '
BEGIN {
split(f,from)
for(i=split(t,to);i;i--) CONV[from]=to
}
/^H/ {header=$0 ; next}
/^T/ {trailer=$0 ; next}
{
typ=substr($0,34,3)
if(typ in CONV) fname="xyz_" CONV[typ] "_" stamp ".txt"
else fname="xyz_error_" stamp ".txt"
if (!(fname in A)) {A[fname]; print header > fname}
print $0 >> fname
close(fname)
}
END {
for (fname in A) print trailer >> fname
}' Test.txt
If there are only a few of these from->to pairs the above should work fine, but if you find yourself with a large list if may be worth a different approach, like putting them is another translate.txt file and changing the code to load this first.
One small change the error file has header and trailer records , can we reduce it to only carry the actual records that have spaces in position 34-36 as well as those records that are not a part of the split condition
Can you supply a demo of the input file with this new line characters, are you talking about blank/empty lines?
To explain the new code:
-v f="TOM PAT SAM BOB KIM" \
-v t="TOM TOM TOM BOB KIM" '
BEGIN {
split(f,from)
for(i=split(t,to);i;i--) CONV[from]=to
}
Above we pass into awk two space separated lists f and t (for from and to). These are loaded into a CONV lookup array. Such that:
CONV[TOM] => TOM
CONV[PAT] => TOM
CONV[KIM] => KIM
Below we check the line and if it's setup in the CONV array we assign the fname and write the header line to the file when this is the first time this file is written. If line is not setup in CONV array fname is "xyz_error_" plus time-stamp instead:
if (typ in CONV) {
fname="xyz_" CONV[typ] "_" stamp ".txt"
if (!(fname in A)) {A[fname]; print header > fname}
}
else fname="xyz_error_" stamp ".txt"
One more change ,I need to replace a bunch of special characters on the input file while I am splitting it
Below is the list that I need to find and then replace with the corresponding value
This list could grow..I need to be able to control this
Character Found Character Replaced
ED SPACE
E9 SPACE
D9 SPACE
C2 SPACE
80 SPACE
99 SPACE
E2 SPACE
0D0A 0A
C3 SPACE
C4 SPACE
C8 SPACE
C9 SPACE
A0 SPACE
FF SPACE
B1 SPACE
83 SPACE
C1 SPACE
E1 SPACE
B7 SPACE
F6 SPACE
F1 SPACE
F3 SPACE
E3 SPACE
1A SPACE