Search a column a return a set of words

Hi

I have two files. One is a text file consisting of sentences i.e. INPUT.txt and the second file is SEARCH.txt consisting of two or three columns. I need help to write a script to search the second column of SEARCH.txt for each set of five words (blue color as set one and green color as set two and red color as set three and so on) of each sentence from the INPUT.txt file. The search condition is to find one set of five words from the second column of SEARCH.txt which match atleast four words from the set of five words from the input sentence and return that set of five words from SEARCH.txt whose corresponding value on the first column is the smallest. [e.g. assumming -2.922845 is bigger than -2.927181]. The search is to be carried out for each set of five words. If there is less than five words in the sentence, the search must stop. Assuming that the columns of SEARCH.txt are separated by tab.

Format of INPUT.txt file.

hai wafam cherol makha palli adubu madu ma yaakhidre haikhre tamlakle .
mahak aroiba yaahip tankhi hai machagi matamda saramba gatetu kaikhere mahakkisu aroiba yaahip tankhi hai  haikhre .

Format of SEARCH.txt file.

-0.9725326      arna thamlamba nongchup santhong gani -0.014587925
-0.9777407      tainaba amanba yamna uningdraba  -0.014587925
-0.9700631      aeroplane adu indira parktara ama     -0.014587925
-1.2438936      mahakki aroiba yaahip tankhi hai -0.014587925
-0.97742474     aroiba yaahip tankhi hai hairi    -0.014587925
-1.391722       hai wafam cherolna makha palli     -0.6328273
-2.922845       hai wafam cherolduna makha palli -0.1190167
-2.915667       hai wafam cherolsina makha palli  -0.5702463
-2.927181       hai wafam paochena makha palli  -0.1963889
-2.925497       hai wafam khangnaduna   -0.6328273
-2.855543       hai wafam ngasigi 
-2.926619       hai wafam thamkharabani
-1.635051       hai wafam thamlamle    -0.4567362
-1.078001       hai wafam thamlamli    -0.8960688
-1.023442       adubu madu makhada yaakhidre haikhre -0.1234433
-1.432234       adubu madu makha yaakhidre haikhre  -0.5432345
-1.1278934      changangei air fieldda hongdok pikhraga   -0.014587925
-0.9567379      nupa machagi matamda saramba gatetu     -0.014587925
-0.5984392      machagi matamda saramba gatetu kaire       -0.014587925
-1.250842       leiriba aduda santri khara thamkhre        -0.014587925

The expected format of OUTPUT.txt is given below.

hai wafam paochena makha palli adubu madu makha yaakhidre haikhre tamlakle.
mahakki aroiba yaahip tankhi hai nupa machagi matamda saramba gatetu mahakki aroiba yaahip tankhi hai haikhre

Thanks in advance :).

Try

awk  '

NR==FNR {if (5 == split ($2, T, " ")) PAT[$2]=$1
         next
        }

        {for (j=0; j<NF; j+=5)  {TMP = ""
                                 MIN = 1E100
                                 for (p in PAT) {CNT=0
                                                 split(p, X, " ")
                                                 for (i=1+j; i<=5+j; i++)
                                                    for (k=1; k<=5; k++) if ($i == X[k]) CNT++
                                                 if (CNT >= 4 && PAT[p] < MIN)  {MIN=PAT[p]
                                                                                 TMP=p
                                                                                }
                                                }
                                 if (TMP)        printf "%s ", TMP
                                 else            printf "%s %s %s %s %s ", $(j+1), $(j+2), $(j+3),P $(j+4), $(j+5)
                                }
         printf "\n"
        }
' FS="\t" OFS="\t" SEARCH.txt  FS=" " INPUT.txt
hai wafam paochena makha palli adubu madu makha yaakhidre haikhre tamlakle .    
mahakki aroiba yaahip tankhi hai nupa machagi matamda saramba gatetu mahakki aroiba yaahip tankhi hai haikhre .    
1 Like

Hi

I tried running this awk script. It worked fine for small size of SEARCH.txt. But, when it comes to large size consisting of 10 millions lines (tuples), I am unable to get any output. Please advice me how do I go ahead. Thanks in advance :slight_smile:

Hi

I need help to write the regular expression if the column separator between the first and the second columns are two possible cases,

(1) in the order of -one blank space and followed by a tab for some cases, and
(2) in the order of - a tab and followed by one blank space for some cases

Thanks in advance :slight_smile:

How about adding {sub (/ | /, "\t"); before splitting $2 in "SEARCH.TXT"?

1 Like

Hi

I did

awk ' NR==FNR  {sub (/| /, "\t");  if (5 == split ($2, T, " ")) PAT[$2]=$1
         next
}

Please correct me if I am wrong.

Did you use <space><TAB>|<TAB><space> in the sub call?

But, I found that even though that replaced your field separator patterns with single <TAB>s and fixed fields 1 and/or 2 by removing spaces, it wouldn't change the operation of the script dramatically.

1 Like