Splitting the files via shell script

Maya_Pillai · July 15, 2010, 3:00am

Hi all,
We have 102 flat files created by Informatica from 102 tables. These 102 files contain pharmcy details.
There are a total of 450 pharmcyids.The naming convention for the flat file is ODS_<TABLE NAME>_yyyymmdd_timestamp.dat.
Each flat file may contain data for 450 pharmacies which is identified by pharmacy ids.
The requirement is to split each of the flat files to 450 pharmcyfiles(102*450).
1)The flat file name is dynamic which changes with date and time evry day
2)The flat file may or may not contain data for some pharmacies. At the maximum, it may conatin 450 pharmcy ids.
3)The output file name is dynamic with the format <pharmacyid>_TABLENAME_yyyymmdd_timestamp.
4)The location of pharmcyid in each input flatfile varies according to the source table structure.

eg:-The input flatfile - ODS_ADT_20100731_000001.dat contains 5 rows
pharmacyid description objectid reasoncode
A123 pharmacy1 101 null
b123 pharmacy2 102 null
C123 pharmacy3 103 null
A123 pharmacy1 104 null
B123 pharmcy 105 null

These data need to be split into 3 output flat files(as there as 3 distinct pharmacy ids)
namely A123_ADT_20100731_000002, B123_ADT_20100731_000002, c123_ADT_20100731_000002.

How can we generate a generic script to handle 102 input files (each 102 file may generate 450 pharmacy files )?

Any help on this is appreciated.

Thanks
Maya

bsnithin · July 15, 2010, 5:20am

Hi Maya,

Are the Pharmacy id's fixed ? In the sense you know all 450 Pharmacy id's right? If yes, I have provided a simple script which can be used.

#!/bin/ksh

pharmacy_ids="p_id1 p_id2 p_id3 ... p_id450" ;

for files in `ls ODS_.*`
do
 for p_ids in $pharmacy_ids
 do
  grep $p_ids $files > ${p_ids}_${files} ;
 done
done

-Nithin.

rdcwayx · July 15, 2010, 8:23am

Some tips for you.

If you can't fix the location of pharmcyid, are there any other way that we can find it?

4)The location of pharmcyid in each input flatfile varies according to the source table structure.

If the pharmcyid is always at column 1. You can use below code to generate the data files by pharmcyid.

awk '{print > toupper($1) "_" FILENAME}' *.dat

For example, from your sample file, you will get three files after run the command.

A123_ODS_ADT_20100731_000001.dat
C123_ODS_ADT_20100731_000001.dat
B123_ODS_ADT_20100731_000001.dat

So you can use cat command to combine all datas with same pharmcyid .

cat A123*20100731*.date > A123_ODS_ADT.dat