Help with Shell Scrip in Masking particular columns in .csv file or .txt file using shell script

Mahesh_G · November 6, 2016, 8:12am

Hello Unix Shell Script Experts,

I have a script that would mask the columns in .csv file or .txt file.

First the script will untar the .zip files from Archive folder and processes into work folder and finally pushes the masked .csv files into Feed folder.

Two parameters are passed

1) Line of Business example : VA

and 2) Date YYYYMMDD : 20161101

The script will read the data from a table in a Database where it will have the position of the columns to be masked for all the .csv or .txt files.

objective here is to mask all the .csv or .txt files that are delimited with comma /pipe which has columns like tax_id and DOB columns.

It performs for one line of business properly across folders like untaring and generating the masked files with columns masked

however for other line of business it is not masking correctly at the positions

DOB format : 9999-12-12 (All dob columns will have this value)
Tax id : xxxx-xx-xx (All tax related id columns will have this format value)

Example : 4 and 30 position columns are TAX_ID and DOB then the script must mask these two columns.

There can be more columns to be masked.

say 10th 11th 12 th 40 th 120th positions so all these columns need to be masked

Script has 5 arrays each of which will store columns that are to be masked.

Problem
In the array am declaring all the columns of all the .csv files that are to be masked.
Array 1 contains all Date of Birth related columns
Array 2 contains all Tax id related columns.
Now one of the .csv file has columns like DATE-OF-BIRTH and TAX-ID
How to add the above 2 columun in the array along with other columns
I have tried by adding quotes single and double but did not work.

Finally the script is not behaving the way it is expected.
The objective of the script must be like : whenever it identifies the columns relsted to Dateof Birth or Tax id which will be hard coded then those must be masked.

Attaching the code for reference

Please let me know
Thanks
Mahesh G

RudiC · November 6, 2016, 8:57am

This request is very difficult to read/interpret. Examples of input and desired output data, demonstrating the logics to be applied, would definitely help. And, maybe a demo run with some defined values, e.g. for filesetcode .

As a general remark, that script seems a bit overcomplicated...

And, what do you mean with "mask" - eliminate column? Overwrite with a constant?

Mahesh_G · November 6, 2016, 10:31am

Data Sample

say one of the .csv file contains two columns DOB and TAX at 5 th and 10 th position. The script has these two column names hard coded.

Now the .csv file has data as shown below:

DOB format : 9999-12-12 (All dob columns will have this value)

Tax id : xxxx-xx-xx (All tax related id columns will have this format value)

Before masking

19990205  123468
20150314  524278
20061231  569874

Now after masking

9999-12-12 xxx-xx-xx
9999-12-12 xxx-xx-xx
9999-12-12 xxx-xx-xx

Please let me know if more information is required

Don_Cragun · November 6, 2016, 4:06pm

mahesh g:

Data Sample

say one of the .csv file contains two columns DOB and TAX at 5 th and 10 th position. The script has these two column names hard coded.

Now the .csv file has data as shown below:

DOB format : 9999-12-12 (All dob columns will have this value)

Tax id : xxxx-xx-xx (All tax related id columns will have this format value)

Before masking
19990205  123468
20150314  524278
20061231  569874
Now after masking
9999-12-12 xxx-xx-xx
9999-12-12 xxx-xx-xx
9999-12-12 xxx-xx-xx
Please let me know if more information is required

This just doesn't make sense. Why should masking a field change an eight digit field to a ten character field containing eight digits and two hyphens??? Why should masking a field change a six digit field to a nine character field containing seven <x>s and two hyphens???

Your text says that a column with the text DOB (apparently in the 1st line of the field) identifies a field number and every that field should be changed on every line in the file to 9999-12-12 . So, you are asking us to change the heading in that field from DOB to 9999-12-12 ??? Do you really want 9999-12-12 , or should it be 9999-12-31 ???

Please show us some representative sample input data (with sanitized, but not masked, data) in CODE tags AND show the the corresponding output that should be produced by your script in CODE tags.

Note that you refer to DOB , dob , Date of Birth , Dateof Birth , and DATE-OF-BIRTH in the above quoted text; I assume that you realize that all five of these column headings are different and none of them will compare equal to any of the others!

Are we supposed to assume that all of the files you will be processing will have the same fields in the same order? Or is each file different?

You say that you have <comma> and <vertical-bar> separated files. Does a single file ever contain both delimiters? If you have a <comma> delimited file, will the data in that file ever contain a <vertical-bar> character as data (not as a field delimiter)?

Mahesh_G · November 6, 2016, 10:39pm

It was for an example I referred above.
say if the column name in the .csv file is TaxID and it has value as shown below

Then aftermasking it is expected that the inside .csv file under the column TXID

xxx-xx-xx
xxx-xx-xx
xxx-xx-xx

Please note it is not that a six digit field to a nine character field containing seven <x>s and two hyphens.. In future the Tax id values can be more than 10 digits. so here the intent is if the column name is Tax_ID then mask. There can be more than one tax related column in the .csv file. Example : TaxID_1 , TaxID_2 , Tax_ID3 and so on. All we are looking here is the column names that will be hard coded in the script.

How to handle if a column name is like TAX-ID or DATE-OF-BIRTH ?

The script reads the input from a table as I said earlier which will have the file name ,the column names and the position of the columns where masking needs to happen.

Example : yes you are right 9999-12-31 should be the one. It was type error my end earlier. It is not the heading am asking to replace but for the values in that column. As I said the table will indicate the position number of the column that needs to be masked.

The .csv file can contain more than 1 column that are related to DOB , Date of Birth or DATE-OF-BIRTH .

Either at one go the file must be either comma delimited or pipe. but not both.

As I shared the script code earlier the flow goes as shown below
First the script will read the table that contains the column position for dob and tax related column . say the position is 5 and 10 for a column name DateOfBirth and Tax_Id respectively then as these two column names are hard coded in side the script it in an array. It will search for these two column names and if found then mask . Finally we will see in the out put masked file 5 and 10 positions will have 9999-12-31 and xxx-xx-xx

thanks
Mahesh G

Don_Cragun · November 6, 2016, 11:59pm

You have a script. (It is grossly inefficient, throws away all diagnostic output, and never checks for errors; so you and we have no idea where something is going wrong if something is going wrong.) You have not said what it is doing wrong. You have not shown us sample CSV files from your database extractions. You have not shown us the output you are getting from your script. And, you have not shown us the output you are hoping to get from your script.

I would guess that instead of invoking awk once for each field to be updated in each file to be updated, you could just invoke awk once to update all of the fields in all of the files. And, this would make your script run much faster. But, without samples and a clear statement of what you are trying to change in your existing script; we don't know what needs to be done.

Please help us help you!