Match and retrieve information from file

Hello

I just want to ask how to get the match of information column 2 file 1 and retrieve information from column 2 file 2. The column exon in file 1 and column color code in file 2.

File 1

//NODECOLORCODE         "Exon 1"        "ENST00000595813"
//NODECOLORCODE         "Exon 1"        "ENST00000596051"
//NODECOLORCODE         "Exon 2"        "ENST00000243643"
//NODECOLORCODE         "Exon 2"        "ENST00000421033"
//NODECOLORCODE         "Exon 3"        "ENST00000243643"
//NODECOLORCODE         "Exon 3"        "ENST00000421033"
//NODECOLORCODE         "Exon 4"        "ENST00000243643"
//NODECOLORCODE         "Exon 4"        "ENST00000421033"
//NODECOLORCODE         "Exon 5"        "ENST00000421033"
//NODECOLORCODE         "Exon 5"        "ENST00000440291"

File 2

"Exon 1"        "#FF0000"
"Exon 2"        "#FFD400"
"Exon 3"        "#FFff00"
"Exon 4"        "#2Bff00"
"Exon 5"        "#00ff00"

Desired output

//NODECOLORCODE         "Exon 1"        "ENST00000595813"        "#FF0000"
//NODECOLORCODE         "Exon 1"        "ENST00000596051"        "#FF0000"
//NODECOLORCODE         "Exon 2"        "ENST00000243643"        "#FFD400"
//NODECOLORCODE         "Exon 2"        "ENST00000421033"        "#FFD400"
//NODECOLORCODE         "Exon 3"        "ENST00000243643"        "#FFff00"
//NODECOLORCODE         "Exon 3"        "ENST00000421033"        "#FFff00"
//NODECOLORCODE         "Exon 4"        "ENST00000243643"        "#2Bff00"
//NODECOLORCODE         "Exon 4"        "ENST00000421033"        "#2Bff00"
//NODECOLORCODE         "Exon 5"        "ENST00000421033"        "#00ff00"
//NODECOLORCODE         "Exon 5"        "ENST00000440291"        "#00ff00"

I try using awk and it doesn't come out with the desired output I want. Thanks

awk 'FNR==NR {C[$2]=$1;next} FNR==1 {print "NODECLASSCOLOR EXON TRANSCRIPT_ID COLORCODE"; next} $1 in C {print C[$1], $1, $4, $2, $3}' 

Are those tab separated? If not, you'll have to pretend they're quote separated and use some tricks to properly grab these fields.

Assuming not tab separated:

$ awk -F\" -v OFS='\t' 'FNR==NR{C[$2]=$4;next}FNR==1{print "NODECLASSCOLOR EXON TRANSCRIPT_ID COLORCODE"; next} 1 {print $1,$2,$4,C[$2]}' file2 file1
NODECLASSCOLOR EXON TRANSCRIPT_ID COLORCODE
//NODECOLORCODE                 Exon 1  ENST00000596051 #FF0000
//NODECOLORCODE                 Exon 2  ENST00000243643 #FFD400
//NODECOLORCODE                 Exon 2  ENST00000421033 #FFD400
//NODECOLORCODE                 Exon 3  ENST00000243643 #FFff00
//NODECOLORCODE                 Exon 3  ENST00000421033 #FFff00
//NODECOLORCODE                 Exon 4  ENST00000243643 #2Bff00
//NODECOLORCODE                 Exon 4  ENST00000421033 #2Bff00
//NODECOLORCODE                 Exon 5  ENST00000421033 #00ff00
//NODECOLORCODE                 Exon 5  ENST00000440291 #00ff00

Hello Wan Fahmi,

Following may also help you in same.

awk 'FNR==NR{B=$NF;$NF="";gsub(/[[:space:]]+$/,X,$0);A[$0]=B;next} ($2" "$3 in A){print $0 OFS A[$2" "$3]}' File2 OFS="\t" File1

Thanks,
R. Singh

---------- Post updated at 11:59 AM ---------- Previous update was at 11:56 AM ----------

Hello neutronscott,

OP needs to get quotes too in output so I am just adding a few addition to it, hope it will help us all.

awk -F\" -v OFS='\t' 'FNR==NR{C[$2]=$4;next}FNR==1{print "NODECLASSCOLOR EXON TRANSCRIPT_ID COLORCODE"; next} 1 {print $1,FS $2 FS,FS $4 FS ,FS C[$2] FS}' File2 File1

Thanks,
R. Singh

Check this following code:

# cat MergeInfo.sh
#!/bin/ksh

[[ $# -ne 2 ]] && {
        echo "Usage:  $0 <File1> <File2>" 1>&2
        exit 1;
}

[ -e $1 ] || {
        echo "$1 does not exists!!!!!." 1>&2
        exit 1;
}

[ -e $2 ] || {
        echo "$2 does not exists!!!!!." 1>&2
        exit 1;
}

awk '{ if (( NF != "4" )) { print FILENAME " - Incorrect file format!!!!!."; exit 1; } }' $1

[[ `echo $?` -ne "0" ]] && {
        exit 1;
}

awk '{ if (( NF != "3" )) { print FILENAME " - Incorrect file format!!!!!."; exit 1; } }' $2

[[ `echo $?` -ne "0" ]] && {
        exit 1;
}

cat $1 | while read COLUMN1 COLUMN2 COLUMN3 COLUMN4;
do
        CODE=`grep -w ${COLUMN2} File2 | grep -w ${COLUMN3} | awk '{print $NF}'`;
        echo "${COLUMN1}  ${COLUMN2} ${COLUMN3}  ${COLUMN4}  ${CODE}";
done

If your awk allows for multichar field separators, try

awk 'FNR==NR {T[$1]=$2; next} {print $0, T[$2]}' 'FS=  +' file2 file1
//NODECOLORCODE         "Exon 1"        "ENST00000595813" "#FF0000"
//NODECOLORCODE         "Exon 1"        "ENST00000596051" "#FF0000"
//NODECOLORCODE         "Exon 2"        "ENST00000243643" "#FFD400"
//NODECOLORCODE         "Exon 2"        "ENST00000421033" "#FFD400"
//NODECOLORCODE         "Exon 3"        "ENST00000243643" "#FFff00"
//NODECOLORCODE         "Exon 3"        "ENST00000421033" "#FFff00"
//NODECOLORCODE         "Exon 4"        "ENST00000243643" "#2Bff00"
//NODECOLORCODE         "Exon 4"        "ENST00000421033" "#2Bff00"
//NODECOLORCODE         "Exon 5"        "ENST00000421033" "#00ff00"
//NODECOLORCODE         "Exon 5"        "ENST00000440291" "#00ff00"