Merge text files while combining the multiple header/trailer records into one each.

oordonez · November 17, 2008, 12:31pm

Situation:
Our system currently executes a job (COBOL Program) that generates an interface file to be sent to one of our vendors. Because this system processes information for over 100,000 employees/retirees (and growing), we'd like to multi-thread the job into processing-groups in order to reduce its run-time. This works fine, however, we're faced with multiple interface files that need to be merged prior transferring to the vendor.

Some Details on the File:
The file generated has a header and a trailer record, and the trailer record has pertinent total values (i.e., employee count, records approved, etc). There are no field separators -- these are fixed length fields.

Predicament in Detail:
We'd like to concatenate the files -- that's the easy part. What makes this difficult is that we need to eliminate the multiple header records and retain only the first one. Also, we need to eliminate the multiple trailer records, but we need to add all the value totals from each trailer into the one trailer record we'll retain at the end.

As you might have surmised by now, I've written some UNIX scripts, but lack some key knowledge related to individual record and field manipulation within a text file. In particular, I'd like to know how I can define specific fields when I read each record -- these are the fields for the trailer records I need to keep a rolling total on. Also, I'd like to know how I can delete individual records.

Any assistance will be greatly appreciated.

jim_mcnamara · November 17, 2008, 12:56pm

You did not give enough information to build a correct script
We need a sample header line a sample data line and a sample trailer line.

oordonez · November 17, 2008, 1:09pm

Sorry about that! Here's a sample file -- incomplete records, though, as they're rather large. But the pertinent information is contained.

BATCH HEADER PRO 0724200808042008
01E000036841 LEAD05151948F 51498 10012007 YYY
02E000036841 ME 04161988F 10012007
01E000060640 MDGV12251951F 51498 1001200709302008YYY
02E000060640 RD 05061941M 1001200709302008
01E000025850 LDUO06081956F 51498 1001200709302008YYY
02E000025850 ED 10071937M 1001200709302008
01E029009859 DUA05021960F 51498 10012007 YYY
02E029009859 LD 03101989F 10012007
02E029009859 LD 02041997M 10012007
01E034008379 AEUA09181965F 51498 10012007 YYY
02E034008379 NE 11131991F 10012007
02E034008379 RE 01131993F 10012007
02E034008379 EE 09191959M 10012007
01E045005523 EUA02131964M 51498 10012007 YNN
01E046004280 DUA12041947M 51498 10012007 YYY
02E046004280 D 12121953F 10012007
02E046004280 KE 09211986M 10012007
01E048005119 BDUA01301961F 51498 10012007 YNN
01E055002147 LDUA10011964F 51498 10012007 YYY
02E055002147 RD 11121966M 10012007
02E055002147 ND 02131997F 10012007
02E055002147 JD 03111992M 10012007
01E057008796 SEUA12061975F 51498 10012007 YYY
BATCH TRAILER 000001150000019908042008

Details on the Trailer Record: the 00000115 is a total value (number of employees), the 00000199 is the total of records processed (employees and dependents). Those two fields I'd need to keep a rolling total for all the files we merge.

The detail records are over 300 characters wide (irrelevant for what we need to do, but thought I include it).

Thank you!

jim_mcnamara · November 17, 2008, 2:03pm

assuming this: 01E000036841 is an employee id and the files are named <something>.dat

ls *.dat | read header dummy
# save copies of header
head -1 $header > tmp

awk '{ if (index($0, "HEADER") > 0 || index($0, "TRAILER") >0 ) {last= $0; continue}
       arr[$0]++; print $0   }
       END { for (i in arr) 
             {
               empcnt++ 
               lc+=arr
             } 
             print empcnt, lc > "cntfile" }  ' *.dat >> tmp
awk ' { rec=sprintf("%08d%08d", $1 $2)}
         END { printf("BATCH TRAILER %s%s\n", rec, substr(last, length(last)-7) } ' cntfile >> tmp
mv tmp employee.dat

This also assumes the last eight characters of BATCH TRAILER are all the same.

summer_cherry · November 17, 2008, 10:06pm

hi below perl may help you a little

usage: perl a.pl NUM FILE1 FILE2 [here NUM indicate how many lines will be header]

a:
*****
line 1
line 2
1 2 3 4 5

b:
*****
line 3
line 4
9 8 7 6 5

output:

*****
line 1
line 2
line 3
line 4
10 10 10 10 10

$header=shift;
undef $/;
my(@head,@body,@foot);
while($file=shift){
	open FH,"<$file" or die "Can not open file $_";
	my $str=<FH>;
	close FH;
	my @temp=split("\n",$str);		
	for( my $i=0;$i<$header;$i++){
		push @head,$temp[$i] if ($#head<$header-1);
	}
	for(my $j=$header;$j<$#temp;$j++){
		push @body,$temp[$j];
	}
	my @footer = split(" ",$temp[$#temp]);
	for($k=0;$k<=$#footer;$k++){
		$foot[$k]=$foot[$k]+$footer[$k];
	}
}
print join "\n",@head;
print "\n",join "\n",@body;
print "\n",join " ",@foot;