Search id from second file and append in first

gina.lizar · February 10, 2014, 4:26pm

Hello,

I want to search a string/substring from the second column in file in another file and append the first found record in second file to the end of the record in the first file. Both files are tab delimited.

All lines with KOG in col13 do not need to be searched as it will not be found.

Here the logic in my head which needs to translate into code.

In all lines which does not contain the keyword 'KOG' in the column 13 of file1

if the second column does not start with sp| or tr| ... it has values like NP_001059837 or AEW46684 then the entire string needs to be searched.

If searched string is found, append the entire first matching line in file2 to the end of the corresponding record in file1 after a tab.

A lot of the second column values repeat, so it would be good if the search for the same value is done only once. For example AEW46684 occurs in the second column of file1 119 times, so searching it just once might save computation with these huge files.

Since there are many columns in the samples I have attached sample inputs and output.

Don_Cragun · February 10, 2014, 4:44pm

What have you tried?

gina.lizar · February 10, 2014, 6:16pm

awk '$13!~KOG && FNR == NR { OFS="\t"; f[++j] = $0; next } { for(i = 1; i <= j; i++) if(index(f, $2)) { $15 = f ; break } print}'  file2_samp.txt file1_samp.txt

This is what I have tried, but its not working and lacks extracting the sub string capability

ahamed101 · February 10, 2014, 10:33pm

Not very clean though but try this.
You can optimize it to some extend depending on which ever file you think might be bigger.
The highlighted part is what you were asking for, the substring logic.

awk 'NR==FNR{
    if($14 ~ /KOG/){
      k=""
    }else if($2 ~ /^sp|^tr/){
      split($2, arr, "|")
      k=arr[2]
    }
    else k=$2
    data[++i]=$0; key=k
    next
  }
  {
    for(j=1;j<=i;j++)
      if(key[j] && match($0,key[j]))
        data[j]=data[j]"\t"$0
  }
  END{
    for(j=1;j<=i;j++) print data[j]"\n\n"
  }
' file1_samp.txt file2_samp.txt > result1.txt

--ahamed

gina.lizar · February 11, 2014, 10:19am

ahamed, the code has been running for 6 hrs without any output so far, file1 is 29 MB and file2 is 8 GB..is there a way to speed up things? Also I think searching the same string just once will save a lot of time.

still no output, help please?

ahamed101 · February 11, 2014, 3:25pm

Did you check the result1.txt?
Anyways, try this

#!/bin/bash

data[0]=""
key[0]=""
count=0

search_add()
{
  inkey=$1;indata=$2;action=$3
  if [ $action == ADD ]; then
    key[$count]=$inkey
    data[$count]=$indata
    ((count+=1))
    return 0
  elif [ $action == SEARCH ]; then
    found=""
    for((i=0;i<$count;i++))
    do
      if [ "${key[$i]}" == "$inkey" ]
      then
        found=${data[$i]}
        return 0
      fi
    done
  fi
  return 1
}

while read first sec remaining
do
  pat=${sec#*|}; pat=${pat%|*}
  search_add $pat "" SEARCH
  if [ $? -ne 0 ]; then
    found=$( grep -m1 $pat file2_samp.txt )
    search_add $pat "$found" ADD
  fi
  echo -e "$first $sec $remaining\t$found\n" >> result1.txt

done < file1_samp.txt

--ahamed

RudiC · February 11, 2014, 4:16pm

Try testing with way smaller but consistent sample files.

gina.lizar · February 11, 2014, 4:57pm

RudiC, I did test with smaller files, it works fine,,,

Ahamed, the second code has started producing appropriate output,,,the first code still hasn't..
Thank you again, i will wait it out..it should be done in 5-6 days going by the amount of output..