I have a set of input strings in a pattern as given below
string1 string2 string3 string4 string5
I need to search this sequence of strings from a file in such a way that the first two strings (string1 and string2) and last two strings (string4 and string5) should match with the strings in the SECOND column of a text file (consisting of three columns) after the comparison of the numbers from the respective column.
So, the script will perform searching for the strings which matches string1, string2, string4 and string5 from a big text file called TEXTFILE.TXT . Then, it'll return the string which has the biggest number (-ve number) in FIRST column and the string which has the biggest -ve number in the third column.
A sample file format of the TEXTFILE.txt is given below. The FIRST and SECOND columns are separated by a tab and SECOND and THIRD columns are separated by a tab and a space. The strings in the SECOND columns are separated by a space. There are multiple entries in the SECOND column which may be of single string upto five strings.
For example, my input is
string1 string2 string3 string4 string5
hai wafam cherol makha palli
Now there are four entries which matches the input in the textfile.So, the output will be the two string sets :
hai wafam cherolna makha palli
hai wafam cherolduna makha palli
File format of TEXTFILE.TXT
-1.391722 hai wafam cherolna makha palli -0.6328273
-2.922845 hai wafam cherolduna makha palli -0.1190167
-2.915667 hai wafam cherolsina makha palli -0.5702463
-2.927181 hai wafam paochena makha palli -0.1963889
-2.925497 hai wafam khangnaduna -0.6328273
-2.855543 hai wafam ngasigi
-2.926619 hai wafam thamkharabani
-1.635051 hai wafam thamlamle -0.4567362
-1.078001 hai wafam thamlamli -0.8960688
-1.023442 adubu madu makhada yaakhidre haikhre -0.1234433
-1.432234 adubu madu ma yaakhidre haikhre -0.5432345
I need help to write a script to perform above task. Thanks in advance .
What tools do you want to use? (Does your forum user name mean you only want to use perl ?)
What have you tried? Please show us what you have tried (in CODE tags). If you had shown us what you have tried, some of the questions below would already have been answered.
You say that the 2nd and 3rd fields in your file are separated by a tab and a space, and that the 1st and 2nd fields are separated by a tab. But, there aren't any tabs in your file and (even if you had converted your tabs to spaces when you pasted your sample TEXTFILE.TXT here, the 3rd field is not aligned as it would be if there had been a tab between fields???
How are the five strings found by your code? Are they operands passed to your script? Are they in another file? If so, what is the format of that file? Can any of the strings contain spaces? Can any of the strings contain any characters that are special in an extended regular expression, or do the strings just consist of alphanumeric characters?
You say that string1 and string2 and string4 and string5 should match field 2 in your file. Do they need to be in sequence within field 2 in the file (as they are in your example), or does each of those four strings just have to appear somewhere within field 2? Is string 3 also supposed to match (as it does in your example), or is your code just supposed to ignore string 3? Can the matches overlap, or does each of the four (or five) strings need to match unique substrings in field 2 (as they do in your example)?
Does the above quote mean that you want the two matching fields that match all four strings and then from all of the fields matching that criteria choose the single matching field that also has the most negative value in field 1 and choose the single matching field that also has the most negative value in field 3?
Or, are you looking for the field matching string 1 and string 2 that has the most negative value in field 1 and for the field matching string 4 and string 5 that has the most negative value in field 3?
Are lines with non-negative values in fields 1 and/or 3 supposed to be ignored when matching the associated strings?
-2.927181 hai wafam paochena makha palli -0.1963889
not show up in your sample output? It has the most negative value of the four matches in the first field.
What be the order in which the output lines shall appear?
I was trying to change in the script by reading the input sentences one after another from a text file called INPUT.txt and perform the above search operation.
INPUT.txt File format is here:
hai wafam cherol makha palli adubu madu makha yaakhidre haikhre tamlakle.
hairiba waridu cherol makhada pallina adubu madu makha yaakhidre haikhre tamlaklenasu hairi.
adubu madui saruk amuk hanna khannanaba wafam thangaatkhre hairi.
.....
.....
I need help in this part, if there is no match found at all at the first search operation, then continue the searching from the second string (string2) till the sixth strinig (string6) of the sentence considering five strings at a time. If match found and retrieved, I want to modify the script in such a way that the same search operation will be repeated for the next set of five strings starting from the sixth string till the end of the sentence since the first five input strings (string1, string2, string3, string4 and string5) are already done. Thus this search operation will continue till the end of the sentence.
You lost me. Please explain in smaller steps. What I understand is instead of five strings
string1 string2 string3 string4 string5
hai wafam cherol makha palli
and dropping the third you'll have "sentences" consisting of between 9 and 12 strings, of which none should be dropped for the comparison, and you want to compare 1 to 5, if no match then 2 to 6, and, if either matches, compare 6 to 10. What if there's only 9 strings in a sentence? If there's more, what to do with string11 and 12?
And, what to print for either match?
In case of only 9 strings, search will be from 1 to 5 and 5 to 9 strings to make five strings for each search and match. In case of 11 or 12 strings, the search will be from 7th to 11th and 8th to 12th to make five strings for the match in all the cases. Thanks a lot .
Considering, the following sentence as input
hai wafam cherol makha palli adubu madu makha yaakhidre haikhre tamlakle.
The expected output will be
hai wafam cherolna makha palli adubu madu makhada yaakhidre haikhre tamlakle.
The last word [tamlakle] is added at the last position as it is assuming that searching from the last five words will not be beneficial since two replacements are already done of the third word for 1-5 words and the third word for 6-10 words. Please let me know if I miss out anything.