Search a text and return the text from file

Hi

I have a set of input strings in a pattern as given below

string1 string2 string3 string4 string5

I need to search this sequence of strings from a file in such a way that the first two strings (string1 and string2) and last two strings (string4 and string5) should match with the strings in the SECOND column of a text file (consisting of three columns) after the comparison of the numbers from the respective column.

So, the script will perform searching for the strings which matches string1, string2, string4 and string5 from a big text file called TEXTFILE.TXT . Then, it'll return the string which has the biggest number (-ve number) in FIRST column and the string which has the biggest -ve number in the third column.

A sample file format of the TEXTFILE.txt is given below. The FIRST and SECOND columns are separated by a tab and SECOND and THIRD columns are separated by a tab and a space. The strings in the SECOND columns are separated by a space. There are multiple entries in the SECOND column which may be of single string upto five strings.

For example, my input is

string1  string2  string3 string4  string5

hai        wafam   cherol   makha   palli

Now there are four entries which matches the input in the textfile.So, the output will be the two string sets :

hai wafam cherolna makha palli
hai wafam cherolduna makha palli

File format of TEXTFILE.TXT

-1.391722       hai wafam cherolna makha palli     -0.6328273
-2.922845       hai wafam cherolduna makha palli -0.1190167
-2.915667       hai wafam cherolsina makha palli  -0.5702463
-2.927181       hai wafam paochena makha palli  -0.1963889
-2.925497       hai wafam khangnaduna   -0.6328273
-2.855543       hai wafam ngasigi 
-2.926619       hai wafam thamkharabani
-1.635051       hai wafam thamlamle    -0.4567362
-1.078001       hai wafam thamlamli    -0.8960688
-1.023442       adubu madu makhada yaakhidre haikhre -0.1234433
-1.432234       adubu madu ma yaakhidre haikhre  -0.5432345

I need help to write a script to perform above task. Thanks in advance .

What operating system and shell are you using?

What tools do you want to use? (Does your forum user name mean you only want to use perl ?)

What have you tried? Please show us what you have tried (in CODE tags). If you had shown us what you have tried, some of the questions below would already have been answered.

You say that the 2nd and 3rd fields in your file are separated by a tab and a space, and that the 1st and 2nd fields are separated by a tab. But, there aren't any tabs in your file and (even if you had converted your tabs to spaces when you pasted your sample TEXTFILE.TXT here, the 3rd field is not aligned as it would be if there had been a tab between fields???

How are the five strings found by your code? Are they operands passed to your script? Are they in another file? If so, what is the format of that file? Can any of the strings contain spaces? Can any of the strings contain any characters that are special in an extended regular expression, or do the strings just consist of alphanumeric characters?

You say that string1 and string2 and string4 and string5 should match field 2 in your file. Do they need to be in sequence within field 2 in the file (as they are in your example), or does each of those four strings just have to appear somewhere within field 2? Is string 3 also supposed to match (as it does in your example), or is your code just supposed to ignore string 3? Can the matches overlap, or does each of the four (or five) strings need to match unique substrings in field 2 (as they do in your example)?

Does the above quote mean that you want the two matching fields that match all four strings and then from all of the fields matching that criteria choose the single matching field that also has the most negative value in field 1 and choose the single matching field that also has the most negative value in field 3?

Or, are you looking for the field matching string 1 and string 2 that has the most negative value in field 1 and for the field matching string 4 and string 5 that has the most negative value in field 3?

Are lines with non-negative values in fields 1 and/or 3 supposed to be ignored when matching the associated strings?

1 Like

Why does

-2.927181       hai wafam paochena makha palli  -0.1963889 

not show up in your sample output? It has the most negative value of the four matches in the first field.
What be the order in which the output lines shall appear?

1 Like

Yes,

-1.391722 hai wafam cherolna makha palli -0.6328273
-2.927181 hai wafam paochena makha palli -0.1963889

We assume that the -ve number -1.391722 is bigger than -2.927181.

Thanks a lot.

Try

awk -vS1=$string1 -vS2=$string2 -vS4=$string4 -vS5=$string5 -F"\t" '
BEGIN                                   {MX1=MX3=-1E100
                                        }
$2 ~ "^" S1 " " S2 ".*" S4 " " S5 "$"   {if ($1 > MX1)  {MX1 = $1
                                                         T1  = $2
                                                        }
                                         if ($3 > MX3)  {MX3 = $3
                                                         T3  = $2
                                                        }
                                        }
END                                     {print T1
                                         print T3
                                        }
' file
hai wafam cherolna makha palli
hai wafam cherolduna makha palli

Hi

I was trying to change in the script by reading the input sentences one after another from a text file called INPUT.txt and perform the above search operation.
INPUT.txt File format is here:

hai wafam cherol makha palli adubu madu makha yaakhidre haikhre tamlakle.
hairiba waridu cherol makhada pallina adubu madu makha yaakhidre haikhre tamlaklenasu hairi.
adubu madui saruk amuk hanna khannanaba wafam thangaatkhre hairi.
.....
.....

I need help in this part, if there is no match found at all at the first search operation, then continue the searching from the second string (string2) till the sixth strinig (string6) of the sentence considering five strings at a time. If match found and retrieved, I want to modify the script in such a way that the same search operation will be repeated for the next set of five strings starting from the sixth string till the end of the sentence since the first five input strings (string1, string2, string3, string4 and string5) are already done. Thus this search operation will continue till the end of the sentence.

Thanks in advance. :slight_smile:

You lost me. Please explain in smaller steps. What I understand is instead of five strings

string1  string2  string3 string4  string5
hai        wafam   cherol   makha   palli

and dropping the third you'll have "sentences" consisting of between 9 and 12 strings, of which none should be dropped for the comparison, and you want to compare 1 to 5, if no match then 2 to 6, and, if either matches, compare 6 to 10. What if there's only 9 strings in a sentence? If there's more, what to do with string11 and 12?
And, what to print for either match?

1 Like

In case of only 9 strings, search will be from 1 to 5 and 5 to 9 strings to make five strings for each search and match. In case of 11 or 12 strings, the search will be from 7th to 11th and 8th to 12th to make five strings for the match in all the cases. Thanks a lot .

Considering, the following sentence as input

hai wafam cherol makha palli adubu madu makha yaakhidre haikhre tamlakle.

The expected output will be

hai wafam cherolna makha palli adubu madu makhada yaakhidre haikhre tamlakle.

The last word [tamlakle] is added at the last position as it is assuming that searching from the last five words will not be beneficial since two replacements are already done of the third word for 1-5 words and the third word for 6-10 words. Please let me know if I miss out anything. :slight_smile:

I use Ubuntu 14.04 LTS and bash shell.