Script to find string based on pattern and search for its corresponding rows in column

as7951 · July 25, 2019, 8:07am

Experts,
Need your support for this awk script.

we have only one input file, all these column 1 and column 2 are in same file and have to do lookup for values in one file(column1 and column2) but output we need in another file
Need to grep row whose string contains 9K from column 1. When found match, grep that row suppose(BGL_0BC_901_1AG_A_CASR9KTR176) and pick corresponding rows form column 2 for that column 1 matched string.
If column2 contains 5 rows, then pick each value one by one and search in column1 and grep its corresponding row data from column 2 and prepare new output file in below format.
In output file all lines should start with columns value/string containing 9K

BGL_0BC_901_1AG_A_CASR9KTR176,BGL_CHT_903_1AC_B_CASR920R879,BGL_BAM_912_2AC_B_CASR920R879,BGL_BAM_912_1AC_B_CASR920R879,BGL_BAM_907_1AC_B_CASR920R879

For example :

if strings in rows of column1 contains 9K. Suppose we find (BGL_0BC_901_1AG_A_CASR9KTR176) in column 1. Now pick it's corresponding row value(BGL_KMR_919_1AC_B_CASR920R899) from column 2 and look in column1

column1,column2
BGL_0BC_901_1AG_A_CASR9KTR176,BGL_KMR_919_1AC_B_CASR920R899
BGL_0BC_901_1AG_A_CASR9KTR176,BGL_CHT_903_1AC_B_CASR920R879
BGL_0BC_901_1AG_A_CASR9KTR176,BGL_0UT_901_1AC_CASR903R551
BGL_0BC_901_1AG_A_CASR9KTR176,BGL_YOT_919_1AC_CASR903R458
BGL_0BC_901_1AG_A_CASR9KTR176,BGL_0BC_901_1AC_T_CASR920R504
BGL_CHT_903_1AC_B_CASR920R879,BGL_BAM_910_1AC_B_CASR920R879
BGL_BAM_910_1AC_B_CASR920R879,BGL_CHT_903_1AC_B_CASR920R879
BGL_BAM_910_1AC_B_CASR920R879,BGL_BAM_912_2AC_B_CASR920R879
BGL_BAM_912_2AC_B_CASR920R879,BGL_BAM_910_1AC_B_CASR920R879
BGL_BAM_912_2AC_B_CASR920R879,BGL_BAM_912_1AC_B_CASR920R879
BGL_BAM_912_1AC_B_CASR920R879,BGL_BAM_907_1AC_B_CASR920R879
BGL_BAM_912_1AC_B_CASR920R879,BGL_BAM_912_2AC_B_CASR920R879
BGL_2BC_901_1AG_A_CASR9KTR124,BGL_BGM_908_1AC_CASR903R173
BGL_2BC_901_1AG_A_CASR9KTR124,BGL_ABT_932_1AC_CASR903R963
BGL_2BC_901_1AG_A_CASR9KTR124,BGL_2BC_901_1AC_T_CASR920R948
BGL_2BC_901_1AG_A_CASR9KTR124,BGL_2BC_901_1AC_T_CASR920R948
BGL_2BC_901_1AG_A_CASR9KTR124,BGL_ABT_918_1AC_CASR903R963

if value(BGL_KMR_919_1AC_B_CASR920R899) is found in column 1 then pick it's corresponding row value from column 2 and if not found output the result in new file like below

output file :

BGL_0BC_901_1AG_A_CASR9KTR176,BGL_KMR_919_1AC_B_CASR920R899

Continuing, Now again for(BGL_0BC_901_1AG_A_CASR9KTR176) start and take the second row value(BGL_CHT_903_1AC_B_CASR920R879) from column 2 and look for this in column 1 and if found pick for its corresponding row value(BGL_BAM_910_1AC_B_CASR920R879) from column 2

column1                            column 2
BGL_CHT_903_1AC_B_CASR920R879,BGL_BAM_910_1AC_B_CASR920R879

Now again look for value(BGL_BAM_910_1AC_B_CASR920R879) in column 1, if found pick for its corresponding row value(BGL_BAM_912_2AC_B_CASR920R879) from column 2 In this case we don't have to consider(BGL_CHT_903_1AC_B_CASR920R879) as we have already looked for it in above part

column 1                          column 2
BGL_BAM_910_1AC_B_CASR920R879 ,BGL_CHT_903_1AC_B_CASR920R879
BGL_BAM_910_1AC_B_CASR920R879,BGL_BAM_912_2AC_B_CASR920R879

Now again look for row value picked from column 2(BGL_BAM_912_2AC_B_CASR920R879) in column 1 and if found pick for its corresponding column 2 value(BGL_BAM_912_1AC_B_CASR920R879) In this case also we don't have to consider(BGL_BAM_910_1AC_B_CASR920R879) as we have already looked for it in above part

BGL_BAM_912_2AC_B_CASR920R879,BGL_BAM_910_1AC_B_CASR920R879
BGL_BAM_912_2AC_B_CASR920R879,BGL_BAM_912_1AC_B_CASR920R879

Now again look for row value picked from column 2(BGL_BAM_912_1AC_B_CASR920R879) in column 1 and if found pick for its corresponding column 2 value(BGL_BAM_907_1AC_B_CASR920R879) Tn this case also we don't have to consider(BGL_BAM_912_2AC_B_CASR920R879 as we have already looked for it in above part

BGL_BAM_912_1AC_B_CASR920R879,BGL_BAM_907_1AC_B_CASR920R879
BGL_BAM_912_1AC_B_CASR920R879,BGL_BAM_912_2AC_B_CASR920R879

Now again look for row value picked from column 2(BGL_BAM_907_1AC_B_CASR920R879) in column 1 and if don't find any corresponding row value in column 2. Stop the search and append the result to above sampple output file in below format

GL_0BC_901_1AG_A_CASR9KTR176,BGL_KMR_919_1AC_B_CASR920R899
BGL_0BC_901_1AG_A_CASR9KTR176,BGL_CHT_903_1AC_B_CASR920R879,BGL_BAM_912_2AC_B_CASR920R879,BGL_BAM_912_1AC_B_CASR920R879,BGL_BAM_907_1AC_B_CASR920R879

Similary grep done for(BGL_0BC_901_1AG_A_CASR9KTR176), have to repeat and keep looking for the values in other rows values in column 2 in complete csv file and have to perform operation like above.

Similary have to perform operation like above for all others strings like BGL_2BC_901_1AG_A_CASR9KTR124 that has 9K in column 1 in this csv file.

Code i am trying but dnt know how to complete the code for above required

awk '
NR==FNR{
    assoc[$1]=$2
    next
}
FNR!=1{         
    printf "%s,%s", $1,$2
    seen[$1]; seen[$2]
    search=$2 
    while((search in assoc) && !(assoc in seen)){
        search=assoc
        printf ",%s", search
        seen
    }
    print ""
    for(var in seen){ 
         delete seen[var]
    }
}' inputfile.csv inputfile.csv > output.csv

Scrutinizer · July 25, 2019, 10:21am

Hi, the FS value does not seem to be set to ,

awk -F, '

Then I get this output with your script:

BGL_0BC_901_1AG_A_CASR9KTR176,BGL_KMR_919_1AC_B_CASR920R899
BGL_0BC_901_1AG_A_CASR9KTR176,BGL_CHT_903_1AC_B_CASR920R879,BGL_BAM_910_1AC_B_CASR920R879,BGL_BAM_912_2AC_B_CASR920R879,BGL_BAM_912_1AC_B_CASR920R879
BGL_0BC_901_1AG_A_CASR9KTR176,BGL_0UT_901_1AC_CASR903R551
BGL_0BC_901_1AG_A_CASR9KTR176,BGL_YOT_919_1AC_CASR903R458
BGL_0BC_901_1AG_A_CASR9KTR176,BGL_0BC_901_1AC_T_CASR920R504
BGL_CHT_903_1AC_B_CASR920R879,BGL_BAM_910_1AC_B_CASR920R879,BGL_BAM_912_2AC_B_CASR920R879,BGL_BAM_912_1AC_B_CASR920R879
BGL_BAM_910_1AC_B_CASR920R879,BGL_CHT_903_1AC_B_CASR920R879
BGL_BAM_910_1AC_B_CASR920R879,BGL_BAM_912_2AC_B_CASR920R879,BGL_BAM_912_1AC_B_CASR920R879
BGL_BAM_912_2AC_B_CASR920R879,BGL_BAM_910_1AC_B_CASR920R879
BGL_BAM_912_2AC_B_CASR920R879,BGL_BAM_912_1AC_B_CASR920R879
BGL_BAM_912_1AC_B_CASR920R879,BGL_BAM_907_1AC_B_CASR920R879
BGL_BAM_912_1AC_B_CASR920R879,BGL_BAM_912_2AC_B_CASR920R879
BGL_2BC_901_1AG_A_CASR9KTR124,BGL_BGM_908_1AC_CASR903R173
BGL_2BC_901_1AG_A_CASR9KTR124,BGL_ABT_932_1AC_CASR903R963
BGL_2BC_901_1AG_A_CASR9KTR124,BGL_2BC_901_1AC_T_CASR920R948
BGL_2BC_901_1AG_A_CASR9KTR124,BGL_2BC_901_1AC_T_CASR920R948
BGL_2BC_901_1AG_A_CASR9KTR124,BGL_ABT_918_1AC_CASR903R963

as7951 · July 25, 2019, 2:42pm

Hi @Scrutinizer.. issue is, in output file each line has to start with string containing 9K like below .

GL_0BC_901_1AG_A_CASR9KTR176,BGL_KMR_919_1AC_B_CASR920R899
BGL_0BC_901_1AG_A_CASR9KTR176,BGL_CHT_903_1AC_B_CASR920R879,BGL_BAM_912_2AC_B_CASR920R879,BGL_BAM_912_1AC_B_CASR920R879,BGL_BAM_907_1AC_B_CASR920R879

For example, consider below sample input file :

Need to grep row whose string contains 9K in column 1 and then grep its corresponding row in column 2. Suppose check for 9K1 then grep A1, check for A1 in column 1 if not, output result as shown in below expected output file
Then again check for 9K1, grep A2, look for A2 in column1, if found grep its corresponding row value B2.
Now check for B2 value in column 1, then grep for C2 instead of A2 as we have already considered previously in lookup. Now look for C2 in column and grep D2 instead of B2.
This needs to be checked for all rows in column 1 that contains 9K, as there can be rows with value 9K2, 9K3, 9K4 in column 1 with corresponding data in column2.

Sample Input File :

9K1,A1
9K1,A2
9K1,A3
9K1,A4 
9K1,A5 
A2,B2
B2,A2
B2,C2
C2,B2
C2,D2
A5,B5
B5,C5
B5,A5
9K1,A6
A6,B6
B6,A6
B6,C6

Output required :

9K1,A1
9K1,A2,B2,C2,D2
9K1,A3
9K1,A4
9K1,A5,B5,C5
9K1,A6,B6,C6

But below is output returned by code.(which is not as expected output required)

9K1,A3
9K1,A4
9K1,A5
9K1,A6,B6,C6

Chubler_XL · July 25, 2019, 4:14pm

You should store all the $2 values that match as a CSV in assoc. Then use a recursive function to walk thru assoc for each match:

awk -F, '
function prn_assoc(val,cnt,newvals)
{
   if(val in seen) return
   seen[val]
   printf ",%s",val
   split(assoc[val], newvals)
   for(cnt in newvals)
       prn_assoc(newvals[cnt])
}
NR==FNR{
    if($1 in assoc)
       assoc[$1]=assoc[$1] FS $2
    else assoc[$1]=$2
    next
}
$1~"9K" {
   printf "%s", $1
   split("", seen)
   seen[$1]
   prn_assoc($2)
   printf "\n"
}' inputfile.csv inputfile.csv > output.csv

as7951 · July 25, 2019, 9:58pm

@chubler_XL....with you code i am getting below 6 row data from which mentioned below one row data(9K1,A5,B5,C5)is missing instead of this in output i am getting row with value 9K1,A5
Could you please help to fix this.

Missing data :

9K1,A5,B5,C5

Output :

9K1,A1
9K1,A2,B2,C2,D2
9K1,A3
9K1,A4
9K1,A5
9K1,A6,B6,C6

Below is Required output :

9K1,A1
9K1,A2,B2,C2,D2
9K1,A3
9K1,A4
9K1,A5,B5,C5
9K1,A6,B6,C6

Chubler_XL · July 25, 2019, 10:34pm

Rows 4 and 5 of your input file have a space following the 2nd field:

9K1,A1
9K1,A2
9K1,A3
9K1,A4<SPACE>
9K1,A5<SPACE>

This is why they don't match the keys later on and is causing your missing data.

as7951 · July 30, 2019, 2:00am

Hi Chubler_XL,

Thank you very much..
Really appreciate your help.
Got the expected output.