Search id from second file and append in first

Hello,

I want to search a string/substring from the second column in file in another file and append the first found record in second file to the end of the record in the first file. Both files are tab delimited.

All lines with KOG in col13 do not need to be searched as it will not be found.

Here the logic in my head which needs to translate into code.

In all lines which does not contain the keyword 'KOG' in the column 13 of file1

Extract substring for searching:when second column has values starting with
sp| or tr| ,, example when value is sp|P32770|NRP1_YEAST , string to be searched is P32770..
when value is tr|N1PNC6|N1PNC6_MYCP1 , string to be searched is N1PNC6

if the second column does not start with sp| or tr| ... it has values like NP_001059837 or AEW46684 then the entire string needs to be searched.

If searched string is found, append the entire first matching line in file2 to the end of the corresponding record in file1 after a tab.

A lot of the second column values repeat, so it would be good if the search for the same value is done only once. For example AEW46684 occurs in the second column of file1 119 times, so searching it just once might save computation with these huge files.

Since there are many columns in the samples I have attached sample inputs and output.

What have you tried?

awk '$13!~KOG && FNR == NR { OFS="\t"; f[++j] = $0; next } { for(i = 1; i <= j; i++) if(index(f, $2)) { $15 = f ; break } print}'  file2_samp.txt file1_samp.txt

This is what I have tried, but its not working and lacks extracting the sub string capability

Not very clean though but try this.
You can optimize it to some extend depending on which ever file you think might be bigger.
The highlighted part is what you were asking for, the substring logic.

awk 'NR==FNR{
    if($14 ~ /KOG/){
      k=""
    }else if($2 ~ /^sp|^tr/){
      split($2, arr, "|")
      k=arr[2]
    }
    else k=$2
    data[++i]=$0; key=k
    next
  }
  {
    for(j=1;j<=i;j++)
      if(key[j] && match($0,key[j]))
        data[j]=data[j]"\t"$0
  }
  END{
    for(j=1;j<=i;j++) print data[j]"\n\n"
  }
' file1_samp.txt file2_samp.txt > result1.txt

--ahamed

1 Like

ahamed, the code has been running for 6 hrs without any output so far, file1 is 29 MB and file2 is 8 GB..is there a way to speed up things? Also I think searching the same string just once will save a lot of time.

still no output, help please?

Did you check the result1.txt?
Anyways, try this

#!/bin/bash

data[0]=""
key[0]=""
count=0

search_add()
{
  inkey=$1;indata=$2;action=$3
  if [ $action == ADD ]; then
    key[$count]=$inkey
    data[$count]=$indata
    ((count+=1))
    return 0
  elif [ $action == SEARCH ]; then
    found=""
    for((i=0;i<$count;i++))
    do
      if [ "${key[$i]}" == "$inkey" ]
      then
        found=${data[$i]}
        return 0
      fi
    done
  fi
  return 1
}

while read first sec remaining
do
  pat=${sec#*|}; pat=${pat%|*}
  search_add $pat "" SEARCH
  if [ $? -ne 0 ]; then
    found=$( grep -m1 $pat file2_samp.txt )
    search_add $pat "$found" ADD
  fi
  echo -e "$first $sec $remaining\t$found\n" >> result1.txt

done < file1_samp.txt

--ahamed

1 Like

Try testing with way smaller but consistent sample files.

RudiC, I did test with smaller files, it works fine,,,

Ahamed, the second code has started producing appropriate output,,,the first code still hasn't..
Thank you again, i will wait it out..it should be done in 5-6 days going by the amount of output..