awk to create subdirectory based on match between two files

cmccabe · March 18, 2019, 8:13pm

In the below awk I am trying to mkdir based of an exact match between file2 line starting with R_2019.... and file1 line starting with R_2019 . When a match is found there is a folder located at /home/cmccabe/run with the same name as the match where each $2 in file1 is a new subdirectory in that folder. There will always be a match to an R_2019.... , but there may be more then one. That is there may be 2 or 3 R-2019.... that have matches but they will always be unique. The awk as is does execute but produces nothing so I tried adding cmd_fmt='mkdir -p "%s/%s" to store each new subdirectory in cmd_fmt . Then added -v cmd_fmt="$cmd_fmt" to the start of the awk to create the matched sub-directory but that did not work as expected. I am using ubuntu 14.04 and added comments. Any line in file1 that has Negative in it can be skipped as well and does not need a sub-directory created. Thank you :).

awk

awk '
    # create an associative array (key/value pairs) based on the file1
    NR==FNR { for(i=2; i<NF; i+=2) a[substr($i,1,7)] = $NF; next } 

    # retrieve the first 7-char of each line in file2 as the key to test against the above hash
    { k = substr($0, 1, 7) }

    # if find k, then print
    k in a { print a[k] "\t" $0 "\t" l }

    # save prev line to 'l' which is the ID
    { l = $0  } 

' RS= file1 RS='\n' file2

file1

IonCode_0267 Negative_water
IonCode_0255 19-0000-LastName-FirstName
IonCode_xxxx 19-0002-L-F
IonCode_xxxx 19-0003-LaNa-FiNa
IonCode_xxxx 19-0004-La-Fi
IonCode_xxxx Control-Positive-0318
R_2019_03_12_13_59_54_user_S5-0000-000-v5.6_Oncomine_Childhood_Cancer_Research_DNA_and_Fusions

file2

R_2019_02_15_11_56_40_user_S5-0000-00-v5.6_Oncomine_Childhood_Cancer_Research_DNA_and_Fusions
R_2019_03_12_11_10_20_user_S5-0000-01-v5.6_Oncomine_Childhood_Cancer_Research_DNA_and_Fusions
R_2019_03_12_13_59_54_user_S5-0000-000-v5.6_Oncomine_Childhood_Cancer_Research_DNA_and_Fusions

Chubler_XL · March 18, 2019, 9:19pm

What I did was add a debug line to your array build code (shown in red below):

awk '
    # create an associative array (key/value pairs) based on the file1
    NR==FNR { for(i=2; i<NF; i+=2) {
        a[substr($i,1,7)] = $NF
        print "a[" substr($i,1,7)"] = " $NF
    }
    next } 

    # retrieve the first 7-char of each line in file2 as the key to test against the above hash
    { k = substr($0, 1, 7) }

    # if find k, then print
    k in a { print a[k] "\t" $0 "\t" l }

    # save prev line to 'l' which is the ID
    { l = $0  } 

' RS= file1 RS='\n' file2

From the example files we get an array as such:

a[Negativ] = R_2019_03_12_13_59_54_user_S5-0000-000-v5.6_Oncomine_Childhood_Cancer_Research_DNA_and_Fusions
a[19-0000] = R_2019_03_12_13_59_54_user_S5-0000-000-v5.6_Oncomine_Childhood_Cancer_Research_DNA_and_Fusions
a[19-0002] = R_2019_03_12_13_59_54_user_S5-0000-000-v5.6_Oncomine_Childhood_Cancer_Research_DNA_and_Fusions
a[19-0003] = R_2019_03_12_13_59_54_user_S5-0000-000-v5.6_Oncomine_Childhood_Cancer_Research_DNA_and_Fusions
a[19-0004] = R_2019_03_12_13_59_54_user_S5-0000-000-v5.6_Oncomine_Childhood_Cancer_Research_DNA_and_Fusions
a[Control] = R_2019_03_12_13_59_54_user_S5-0000-000-v5.6_Oncomine_Childhood_Cancer_Research_DNA_and_Fusions

As no lines in file2 start with Control Negativ 19-0000 thru 19-0004 you get no output.

cmccabe · March 18, 2019, 10:01pm

Switching file1 and file2 should match the R_2019...] , but the sub-directories are going to created in the same directory where file1 and file2 exsist not in the desired i think. Thank you :).