Shell script to apply functions to multiple columns dynamically

mkathi · November 10, 2018, 5:18pm

Hello,

I have a requirement to apply hashing algorithm on flat file on one or more columns dynamically based on header

sample input file

ID|NAME|AGE|GENDER
10|ABC|30|M
20|DEF|20|F

say if i want multiple columns based on the header example id,name or id,age or name,gender and hash and store the output value in a new column for each row

example

echo -n '10,abc'|md5sum =hashvalue

expected sample output:

ID|NAME|AGE|GENDER|HASHEDCOLUMNS|HASHVALUE
10|abc|30|M|ID,NAME|hashvalue
20|def|20|F|ID,NAME|hashvalue

or

ID|NAME|AGE|GENDER|HASHEDCOLUMNS|HASHVALUE
10|abc|30|M|ID,gender|hashvalue
20|def|20|F|ID,gender|hashvalue

I am having hard time visualizing this
I have written something but i dont know if its going to work or not really lost here.

#!/usr/bin/sh

target="id"   # here i m passing the header column to be md5 applied ideally this is parameterized and can send more than one column name seperated by comma "
target_idx=   # the column number of that field

{
  # reading header.
  IFS=, read -r -a header
  for idx in "${!header[@]}"; do  # and look for the target field in it
    [[ ${header[$idx]} = $target ]] && { target_idx=$idx; break; }
  done
  # then, iterate over lines
  while IFS=, read -r -a line; do
    a=echo-n ${line[$target_idx]}|md5sum # if my input in target is "id"
  done
} <testfile.txt

echo$line|"$a"

Thanks,
kathi

bakunin · November 10, 2018, 7:30pm

If you want multiple columns you need to make the variable target_id and target_idx both arrays, no?

Then there is this:

  # then, iterate over lines
  while IFS=, read -r -a line; do
    a=echo-n ${line[$target_idx]}|md5sum # if my input in target is "id"
  done
} <testfile.txt

echo$line|"$a"

You iterate over the input lines, yes, but you produce output (the echo-line) only once: after you have read all the input. This must be wrong, no? Apart from the fact that after "echo" there needs to be a space (like after any command), but i suppose that was simply a typing error.

Also this line is not correct:

a=echo-n ${line[$target_idx]}|md5sum # if my input in target is "id"

It should perhaps read:

a=$(echo -n ${line[$target_idx]}|md5sum) # if my input in target is "id"

For parametrisation via the commandline i suggest to read the man page for getopts . I have just written an example in another thread you might want to use as a starting point.

I hope this helps.

bakunin

mkathi · November 10, 2018, 8:20pm

Hello you are right

I have to pass the input as arrays yes and i did make a typo near the echo

yes the echo at the end is outside the loop so it will only print out the last row i think

issue is even if i pass the target as variable the code is only picking the first value passed generating the md5 and outputting the md5 for that line. I am trying to figure out how can i loop through all the columns passed inside the array and concatenate them using, and then calculate their md5 and then write it back to the file as a new column.

I know its a strange requirement Please point me to the write direction

I m trying something like this with no avail

#!/usr/bin/sh

target=("id,name")  
target_idx=()   # the column number of that field
{
  # reading header.
for i in $(echo $target | sed "s/,/ /g");do # trying to pass the comma seperated values inside target variable
  IFS=, read -r -a header
  for idx in "${!header[@]}"; do  # and look for the target field in it
    [[ ${header[$idx]} = $target ]] && { target_idx=$idx; break; }
  done
 done
  # then, iterate over lines
  while IFS=, read -r -a line; do
    a=$(echo -n ${line[$target_idx]}|md5sum)
    echo "$line"
  done
}

mkathi · November 10, 2018, 8:45pm

Hello you are right

I have to pass the input as arrays yes and i did make a typo near the echo

yes the echo at the end is outside the loop so it will only print out the last row i think

issue is even if i pass the target as variable the code is only picking the first value passed generating the md5 and outputting the md5 for that line. I am trying to figure out how can i loop through all the columns passed inside the array and concatenate them using, and then calculate their md5 and then write it back to the file as a new column.

I know its a strange requirement Please point me to the write direction

I m trying something like this with no avail

#!/usr/bin/sh

target=("id,name")  
target_idx=()   # the column number of that field
{
  # reading header.
for i in $(echo $target | sed "s/,/ /g");do # trying to pass the comma seperated values inside target variable
  IFS=, read -r -a header
  for idx in "${!header[@]}"; do  # and look for the target field in it
    [[ ${header[$idx]} = $target ]] && { target_idx=$idx; break; }
  done
 done
  # then, iterate over lines
  while IFS=, read -r -a line; do
    a=$(echo -n ${line[$target_idx]}|md5sum)
    echo "$line"
  done
}

------ Post updated at 01:45 AM ------

Just a thought may be i should change my thinking. Just throwing it out i still did not figure out how to write the code but i may have a starting point

step one is if i can concatenate all the column names i send as a parameter to the script and then apply md5 function on it and then store the result of the output in a column in the end does this make any sense.

RudiC · November 11, 2018, 11:02am

Not sure I fully understand what you're after, but try

awk -F\| '
NR == 1         {for (i=1; i<=NF; i++) if ("," MCOL "," ~ "," $i ",") COL[++CNT] = i
                 print $0, "HASHED COLUMNS", "HASHVALUE"
                 next
                }
                {TMP = ""
                 for (i=1; i<=CNT; i++) TMP = TMP $(COL)
                 ("echo -n " TMP " | md5sum") | getline MD5
                 sub (/ *-/, "", MD5)
                 print $0, MCOL, MD5
                }
' OFS="|" MCOL="ID,NAME" file
ID|NAME|AGE|GENDER|HASHED COLUMNS|HASHVALUE
10|ABC|30|M|ID,NAME|f2e88b4a581adf6c7833d95799b7a08c
20|DEF|20|F|ID,NAME|09b01f45ed9092b4c3c155911571bf31

and report back how far it gets you...

mkathi · November 11, 2018, 12:07pm

rudic:

Not sure I fully understand what you're after, but try

awk -F\| '
NR == 1         {for (i=1; i<=NF; i++) if ("," MCOL "," ~ "," $i ",") COL[++CNT] = i
   print $0, "HASHED COLUMNS", "HASHVALUE"
   next
   }
   {TMP = ""
   for (i=1; i<=CNT; i++) TMP = TMP $(COL)
   ("echo -n " TMP " | md5sum") | getline MD5
   sub (/ *-/, "", MD5)
   print $0, MCOL, MD5
   }
' OFS="|" MCOL="ID,NAME" file
ID|NAME|AGE|GENDER|HASHED COLUMNS|HASHVALUE
10|ABC|30|M|ID,NAME|f2e88b4a581adf6c7833d95799b7a08c
20|DEF|20|F|ID,NAME|09b01f45ed9092b4c3c155911571bf31

and report back how far it gets you...

hello firstly thanks for the time.

Yes this is exactly what I am looking for but the generated hash values do not match in the above output for example the first row

10|ABC|30|M|ID,NAME|f2e88b4a581adf6c7833d95799b7a08c

the hash value is f2e88b4a581adf6c7833d95799b7a08c but here I am expecting output hash value as 73aca49763216fb96bbc2acef7b60afb i.e

10|ABC|30|M|ID,NAME|73aca49763216fb96bbc2acef7b60afb

the way i arrived to this value is

echo -n '10,abc'|md5sum and this returns me 73aca49763216fb96bbc2acef7b60afb

thanks

RudiC · November 11, 2018, 12:24pm

Please be aware that the md5sum of '10,abc' will NEVER be 73aca49763216fb96bbc2acef7b60afb as it is case sensitive.
Looks like you want a comma included. Try

awk -F\| '
NR == 1         {for (i=1; i<=NF; i++) if ("," MCOL "," ~ "," $i ",") COL[++CNT] = i
                 print $0, "HASHED COLUMNS", "HASHVALUE"
                 next
                }
                {TMP = ""
                 for (i=1; i<=CNT; i++) TMP = TMP "," $(COL)
                 ("echo -n " substr (TMP, 2) " | md5sum") | getline MD5
                 sub (/ *-/, "", MD5)
                 print $0, MCOL, MD5
                }
' OFS="|" MCOL="ID,NAME" file
ID|NAME|AGE|GENDER|HASHED COLUMNS|HASHVALUE
10|ABC|30|M|ID,NAME|73aca49763216fb96bbc2acef7b60afb
20|DEF|20|F|ID,NAME|9d6555fe65eb60b2f7d9174b56f667f5

mkathi · November 11, 2018, 12:47pm

rudic:

Please be aware that the md5sum of '10,abc' will NEVER be 73aca49763216fb96bbc2acef7b60afb as it is case sensitive.
Looks like you want a comma included. Try

awk -F\| '
NR == 1         {for (i=1; i<=NF; i++) if ("," MCOL "," ~ "," $i ",") COL[++CNT] = i
   print $0, "HASHED COLUMNS", "HASHVALUE"
   next
   }
   {TMP = ""
   for (i=1; i<=CNT; i++) TMP = TMP "," $(COL)
   ("echo -n " substr (TMP, 2) " | md5sum") | getline MD5
   sub (/ *-/, "", MD5)
   print $0, MCOL, MD5
   }
' OFS="|" MCOL="ID,NAME" file
ID|NAME|AGE|GENDER|HASHED COLUMNS|HASHVALUE
10|ABC|30|M|ID,NAME|73aca49763216fb96bbc2acef7b60afb
20|DEF|20|F|ID,NAME|9d6555fe65eb60b2f7d9174b56f667f5

Yes thank-you, you are right I should be more careful with the cases but the output now is as expected
Also can the last line MCOL="ID,NAME" be parameterized example

mcols=$1
filename=$2
awk -F\| '
NR == 1         {for (i=1; i<=NF; i++) if ("," MCOL "," ~ "," $i ",") COL[++CNT] = i
                 print $0, "HASHED COLUMNS", "HASHVALUE"
                 next
                }
                {TMP = ""
                 for (i=1; i<=CNT; i++) TMP = TMP "," $(COL)
                 ("echo -n " substr (TMP, 2) " | md5sum") | getline MD5
                 sub (/ *-/, "", MD5)
                 print $0, MCOL, MD5
                }
' OFS="|" MCOL="$(mcols)" file

and call the script like sh myscript.sh "ID,NAME"(I think it is a silly question but never the less I am asking)

Also I am trying to understand your code line by line could you point me how debug this code I am not asking you to explain line by line but can you pls point me towards the direction for me to better understand the code.

Thanks.

RudiC · November 11, 2018, 5:30pm

Yes.

How about just trying and using MCOL="$1" ? The syntax you used is called "command substitution".

When operating on the first data line, I collect the target column numbers as indices into an array. For all remaining lines, I assemble these column values separated by commas into a TMP variable, execute echo ... | md5sum on it, getline the result into a variable, and, after some massaging, print the desired outout line.
Aside, if your file has more lines than awk allows for open files, you'd need to close the system calls after each use...

mkathi · November 12, 2018, 12:21pm

Yes I will try command substitution when I reach to work tomorrow online unix terminals are giving me a hard time.

Thanks for explaining the code after a lot of googling I am at the stage where i can understand 80% of the code written except why "," MCOL "," ~ "," $1 "," the "," in this if statement but I am learning awk and will figure it out soon.

I don't quite understand what

actually means is it a syntax or is it like open and closing cursors in plsql( sorry bas example but sql is the only language i am comfortable for now)

Thanks.

RudiC · November 12, 2018, 2:45pm

Print out the two and compare / apply the matching operator ~ .

awk allows for a not too small but limited number of open files, of which each echo ... | md5sum consumes one. So, once you reach that limit, action needs to be taken.

mkathi · November 12, 2018, 3:00pm

my files reea hude ranging from 10000 rec to 30000000 but I am going to use one file at time i mean open one file at a time will that still be a issue.

RudiC · November 12, 2018, 3:18pm

That script consumes an "open file" for every line in your input file. 10000 may but 3000000 definitively will be too many. Try

awk -F\| '
NR == 1         {for (i=1; i<=NF; i++) if ("," MCOL "," ~ "," $i ",") COL[++CNT] = i
                 print $0, "HASHED COLUMNS", "HASHVALUE"
                 next
                }
                {TMP = "echo -n " $(COL[1]) ","
                 for (i=2; i<CNT; i++) TMP = TMP $(COL) ","
                 TMP = TMP $(COL[CNT]) " | md5sum"
                 TMP | getline MD5
                 close (TMP)
                 sub (/ *-/, "", MD5)
                 print $0, MCOL, MD5
                }
 ' OFS="|" MCOL="ID,NAME,AGE" file

or

awk -F\| '
NR == 1         {for (i=1; i<=NF; i++) if ("," MCOL "," ~ "," $i ",") COL[++CNT] = i
                 print $0, "HASHED COLUMNS", "HASHVALUE"
                 next
                }
                {TMP = "echo -n "
                 for (i=1; i<=CNT; i++) TMP = TMP $(COL) ","
                 sub (/,$/, " | md5sum", TMP)
                 TMP | getline MD5
                 close (TMP)
                 sub (/ *-/, "", MD5)
                 print $0, MCOL, MD5
                }
' OFS="|" MCOL="ID,NAME,AGE" file

mkathi · November 12, 2018, 3:26pm

rudic:

That script consumes an "open file" for every line in your input file. 10000 may but 3000000 definitively will be too many. Try

awk -F\| '
NR == 1         {for (i=1; i<=NF; i++) if ("," MCOL "," ~ "," $i ",") COL[++CNT] = i
   print $0, "HASHED COLUMNS", "HASHVALUE"
   next
   }
   {TMP = "echo -n " $(COL[1]) ","
   for (i=2; i<CNT; i++) TMP = TMP $(COL) ","
   TMP = TMP $(COL[CNT]) " | md5sum"
   TMP | getline MD5
   close (TMP)
   sub (/ *-/, "", MD5)
   print $0, MCOL, MD5
   }
 ' OFS="|" MCOL="ID,NAME,AGE" file

or

awk -F\| '
NR == 1         {for (i=1; i<=NF; i++) if ("," MCOL "," ~ "," $i ",") COL[++CNT] = i
   print $0, "HASHED COLUMNS", "HASHVALUE"
   next
   }
   {TMP = "echo -n "
   for (i=1; i<=CNT; i++) TMP = TMP $(COL) ","
   sub (/,$/, " | md5sum", TMP)
   TMP | getline MD5
   close (TMP)
   sub (/ *-/, "", MD5)
   print $0, MCOL, MD5
   }
' OFS="|" MCOL="ID,NAME,AGE" file

Thanks i will have a chance to run this tomorrow against a large dataset and i will get back with the results..
note: this script taught me a lot about indexes thanks for that.