Standardization of input source data files using shell script

Prat_Khos · January 14, 2014, 3:00am

Hi there,

I'm a newbie in unix and am fishing for options related to how raw input data files are handled. The scenario, as I'm sure y'all must be very familiar with, is this : we receive upwards of 50 data files in ASCII format from various source systems - now each file has its own structure (columns, datatypes etc) as well as certain "impurities" e.g. leading/trailing whitespaces, junk characters (produced during conversion from mainframe data to ASCII) etc...there is a need to 'sanitize' these files i.e. strip them of whitespaces, junk characters etc - how do we do this.....

Ideally, we would like to have a common shell script that parses each input file and produces a clean version (is this possible? will I need to have multiple shell scripts , one for each file?)

Can you please provide feedback based on your experience...

Thanks

Franklin52 · January 14, 2014, 3:55am

Post a sample of your input file and the desired output.

Skrynesaver · January 14, 2014, 3:57am

I have a similar scenario and use Perl script that has a record extraction sub which is called with the appropriate pre-compiled regex for the data format based on the file name, it creates a filename.clean.csv which is then sqlldr'd into the system.

RudiC · January 14, 2014, 3:58am

While removing leading/trailing whitespace will be not too difficult a task, the question regarding junk chars may range from easy too complex, depending e.g. from the locale and the char set that you are using. And, if it comes to different data structures, I guess at least one description/definition file per data file type is necessary.