Replacing 12 columns of one file by second file based on mapping in third file

megh12 · July 29, 2016, 3:28am

i have a real data prod file with 80+ fields containing 1k -2k records. i have to extract say 12 columns out of this which are sensitive fields along with one primary key say SEQ_ID (like DOB,account no, name, SEQ_ID, govtid etc) in a lookup file. i have to replace these sensitive fields in lookup file by mocked data contained in second file say mocked_Store.dat but the order of fields in second file may be different from first file. The mapping of which field number in first file to be replaced by which field no of second file is contained in a third file called mapping.dat. Once the replacement is done, those 12 sensitive fields need to be put back in original prod file. In the end i need, prod file having mocked data on few fields+ lookup file having original fields and mocked fields value. Please help.TIA

RudiC · July 29, 2016, 3:55am

How about some decent sample input data (sensitive info concealed), abbreviated if need be, the desired output, and a description of the logics/algorithms that connect the two?

rbatte1 · July 29, 2016, 4:13am

So this is to anonymise the production data I think. Is that right?

Do you need the overwritten details to be unique or could they be the same, or perhaps a simple sequential value, e.g. Name becomes Fnamaaaaaaaa Snamaaaaaaaa through to Fnamzzzzzzzz Snamzzzzzzzz

If you need them to be random then that could probably be done or if you need to generate the random details and be able to reuse them then if we can store them in a file, we might be able to use awk to read two files in and merge the output. It might even be possible with paste

What we will need is some good input data (sanitised of course) and with a fuller description of what/how you need to achieve.

If you could paste your input/output in CODE tags, it will make it easier to read and preserve multiple spaces for fixed width data.

Kind regards,
Robin

megh12 · July 29, 2016, 4:59am

reposting my question with more details as advised:
i have a real data prod file with 80+ fields. Need to replace say 12 columns out of this(based on field#) which are sensitive fields by mocked data contained in second file say mocked_Store.dat row by row. The mapping of which field number in first file to be replaced by which field no of second file is contained in a third file called mapping.dat. In the end i need, rea file having mocked data on given fields and a lookup file having original fields and corresponding mocked fields value. it also needs to have a primary key from prod file <say SEQ_ID, position given in mapping file>. Please help.TIA. Please see expected input/output through below example with sample data.

Input 1: sample prod file:

col a|SEQ_ID| first_name|last_name|full_name|DOB|col b| col c|Govt_id|col d
value a|100000|vijayendra|kumar|vijayendra kumar|10/101984|value b|value c |AOYUGH9282P|value d
value a1|100003|ravi|kumar|ravi kumar|01/01/1987|value b1|value c1|AOJKUYT0908P|value d1

Input 2 : mocked_store.dat:\(containing mocked data\)

DOB|full_name|Govt_id|first_name|last_name|
02/02/1981|Meena Kumari|ABCDEF1232F|Meena|Kumari
02/02/1982|Dhyan Chand|ABCD4567M|Dhyan|Chand

Input 3: mapping file:

Prod file Field number| mocked_store file Field number
3|4  #first_name
4|5 #last_name
5|2 #full_name
6|1#DOB
9|3#Govt_id
2|  #SEQ_ID not to be replaced

OUTPUT1: prod file <actual data replaced by mocked data>

col a|SEQ_ID| first_name|last_name|full_name|DOB|col b| col c|Govt_id|col d
value a|100000|Meena |kumari|Meena kumari|02/02/1981|value b|value c|ABCDEF1232F|value d
value a1|100003|Dhyan |chand|Dhyan Chand|02/02/1982|value b1|value c1|ABCD4567M|value d1

OUTPUT 2: lookup file < prod data+ mocked up data ,fields in order of mapping file>

first_name|first_name_mocked|last_name|last_name_mocked|full_name|full_name_mocked|DOB|DOB_mocked|Govt_id|Govt_id_mocked|SEQ_ID

vijayendra|Meena|kumar|Kumari|vijayendra kumar|Meena Kumari|10/10/1984|02/02/1981| AOYUGH9282P| ABCDEF1232F|100000

ravi|Dhyan|kumar|Chand|ravi kumar|Dhyan Chand|01/01/1987|02/02/1982|AOJKUYT0908P| ABCD4567M|100003

Please NOTE: The real data file does not have field headers. The output lookup file data fields can be in any order. currently i am thinking of having it in same order as mapping file.

---------- Post updated at 03:59 AM ---------- Previous update was at 03:49 AM ----------

Thankyou for suggesting edit in the post. yes this is to mask data such that data remains valid but not the real. so we are provided with mock stores which are created by us and is completely fake data. Please share the relevant awk commands to accomplish this. i have added sample data in my example to make clear expected input output.

RudiC · July 29, 2016, 5:46am

You connect fields in the prod file with fields in the "mock" file. How are lines (resp. records) in those files connected? By line No.?

RudiC · July 29, 2016, 6:52am

Assuming the lines are connected by line No. Try

awk -F"|" '
FNR == 1        {NoF++
                 next
                }
NoF == 1        {sub (/ *#.*$/, _)
                 TR1[$2] = $1
                 COL[$1]
                 next
                }
NoF == 2        {for (i=1; i<=NF; i++) TR2[FNR,TR1] = $i
                 next
                }

NoF == 3        {for (c in COL)         {printf "%s%s%s%s", $c, OFS, TR2[FNR,c], OFS > "lookupfile"
                                         if (TR2[FNR,c]) $c = TR2[FNR,c]
                                        }
                 printf RS > "lookupfile"
                }

1

' mapfile mockfile OFS="|" prodfile
value a|100000|Meena|Kumari|Meena Kumari|02/02/1981|value b|value c |ABCDEF1232F|value d
value a1|100003|Dhyan|Chand|Dhyan Chand|02/02/1982|value b1|value c1|ABCD4567M|value d1
cat lookupfile 
100000||vijayendra|Meena|kumar|Kumari|vijayendra kumar|Meena Kumari|10/101984|02/02/1981|AOYUGH9282P|ABCDEF1232F|
100003||ravi|Dhyan|kumar|Chand|ravi kumar|Dhyan Chand|01/01/1987|02/02/1982|AOJKUYT0908P|ABCD4567M|

If the lookupfile's structure doesn't suit you, additional measures must be taken.

megh12 · July 29, 2016, 2:56pm

Thanks a lot RudiC for the detailed reply. yes we will do the replacement row wise line by line. I put your code in a script create_testdata.ksh and ran it. Below are my observations please:

 $ cat prodfile
value a|100000|vijayendra|kumar|vijayendra kumar|10/101984|value b|value c |AOYUGH9282P|value d
value a1|100003|ravi|kumar|ravi kumar|01/01/1987|value b1|value c1|AOJKUYT0908P|value d1
value a2|100005|nisha|verma|nisha verma|12/12/1987|value b2|value c2|AOJYGFT345F|value d2
  
 $ cat mockfile
DOB|full_name|Govt_id|first_name|last_name|
02/02/1981|Meena Kumari|ABCDEF1232F|Meena|Kumari|
02/02/1982|Dhyan Chand|ABCD4567M|Dhyan|Chand|
02/02/1983|John Abraham|ABCDEF234M|John|Abrahm|
  
 $ cat mapfile
prodfile field number|store file Field number
3|4  #first_name
4|5 #last_name
5|2 #full_name
6|1#DOB
9|3#Govt_id
2|  #SEQ_ID not to be replaced

output:

$ ./create_testdata.ksh
value a1|100003|Meena|Kumari|Meena Kumari|02/02/1981|value b1|value c1|ABCDEF1232F|value d1
value a2|100005|Dhyan|Chand|Dhyan Chand|02/02/1982|value b2|value c2|ABCD4567M|value d2
  
 $ cat lookupfile
kumar|Kumari|ravi kumar|Meena Kumari|01/01/1987|02/02/1981|AOJKUYT0908P|ABCDEF1232F|100003||ravi|Meena|
verma|Chand|nisha verma|Dhyan Chand|12/12/1987|02/02/1982|AOJYGFT345F|ABCD4567M|100005||nisha|Dhyan|

Mostly it ran perfect except few minors please:

we are missing first line of prod file always
the first line of mockfile is replacing second line of prodfile & so on not row wise< if we remove the header in mockfile the order becomes correct but the header is needed in this file>
Lookup file fields are in same order as map file except the first field<first_name> which is coming in end in place of begining.

Thanks again for your kind patience . please elaborate on the code a little so I could understand it and expand it if needed.

RudiC · July 29, 2016, 3:41pm

In your samples the prod file had a header line. ALL header lines are discarded.

NoF is the sequence No. of the input files. With the first, a field mapping array is filled. With the second, a translation array (indexed by line No. (FNR) and target field No.) is built, i.e. the strings to be filled into the resp. fields in resp. lines. When the third is worked upon, the target fields are being replaced from the beforementioned array, and the lookup file is being written.

megh12 · July 30, 2016, 6:56am

Hi RudiC
Please help me in correcting the lookup file so that order of records appearing in it is same as mapfile .currently the field 'first_name' is appearing as last field in lookup instead of first field as per map file.TIA

RudiC · July 30, 2016, 7:18am

So - what did this statement mean?

EDIT: Your additional request means a major rewriting of the scriptlet, as ( man awk ):

megh12 · July 30, 2016, 8:05am

okay thankyou

RudiC · July 30, 2016, 8:45am

After some cogitating, the rewrite is not that complex, although it needs to set up another, entirely new and sort of independent data structure. Try

awk -F"|" '
FNR == 1        {NoF++
                 next
                }
NoF == 1        {sub (/ *#.*$/, _)
                 TR1[$1]   = $2
                 SQ[++SMX] = $1
                }
NoF == 2        {for (t in TR1) if (TR1[t]) TR2[FNR,t] = $(TR1[t])
                }

NoF == 3        {for (i=1; i<=SMX; i++) printf "%s|%s|", $SQ, TR2[FNR,SQ] > LOF
                 printf RS > LOF
                 for (t in TR1)         if (TR1[t]) $t = TR2[FNR,t]
                 print
                }

' mapfile mockfile OFS="|" LOF="lookupfile" prodfile
value a|100000|Meena|Kumari|Meena Kumari|02/02/1981|value b|value c |ABCDEF1232F|value d
value a1|100003|Dhyan|Chand|Dhyan Chand|02/02/1982|value b1|value c1|ABCD4567M|value d1
cat lookupfile 
vijayendra|Meena|kumar|Kumari|vijayendra kumar|Meena Kumari|10/101984|02/02/1981|AOYUGH9282P|ABCDEF1232F|100000||
ravi|Dhyan|kumar|Chand|ravi kumar|Dhyan Chand|01/01/1987|02/02/1982|AOJKUYT0908P|ABCD4567M|100003||