I am trying to search a given text in a file and find its last occurrence index. The task is to append the searched index in the same file but in a separate column. I am able to accomplish the task partially and looking for a solution.
Following is the detailed description:
names_file.txt
111|lkasjdfaaa|555|tarun trehan
111|aaa|55765|vikram batra
111|aaa|555|allzhere blog
65876111|aaa|555|allzhere android apps
111|hhhaaa|555|allzhere on facebook
111|aaa|555|contact updater utility
111|aaa|444555|help me
Assuming, i am trying to search last occurrence index of "n" in the name i.e. my last column of file. So, the output file should be as following where the last column is representing the last index of "n" in the name column:
i am getting a syntax error around the n = match ( $i, /n[^n]*$/ ) ..may be i am not pasting the character correctly.
Use nawk instead if you are using SunOS or Solaris
This code is scanning all variables and i would like to search only the 4th column. I think i can do that by adding a if clause for i==4.
Yes, remove the for loop and put $4 instead: n = match ( $4, /n[^n]*$/ )
I am onto text mining a huge file set i.e. 10 million records. Is line by line scanning a viable option ??
Yes.
Thanks for the inputs. nawk works but have another challenge here:
The string to search for is dynamic and hence needs to passed as a variable to awk.
I tried the following code though i was not confident about it.
Doesn't work as expected. Can you please provide your inputs here :
export SRCH_PARAM="an"
nawk -v X="$SRCH_PARAM" -F\| '
{
for ( i = 1; i <= NF; i++ )
{
n = match ( $4, /n[^X]*$/ )
}
if ( n )
print $0 OFS n
} ' OFS=\| names_file.txt
The project involves searching a list of keywords in a master input file.
master_file.txt containing strings. Record Count : 20000000 String Records.
keywords.txt containing keywords. Record Count : 200,000 Unique Keywords.
I run a shell script which reads a keyword and runs the above mentioned command to search in master_file.txt to append desired output.
The end result is being achieved as per our expectations.
However, i am concerned about the performance and response time of this utility.
I tried with 16000 keywords and 20000 master records and the process took around 25 minutes.
I am looking to reduce this number and considered the following:
Split up file into n part and run searches in parallel and then collate results?
Possible tweaking for commands ?
Is text mining in shell correct from a design and feasibility perspective ?
You shouldn't run one command (= one process) per keyword, that's definitely too ineffective. On the other hand, the numbers you indicated might be too high for grep or awk . There's programs/applications/databases out there designed for text mining - I'm sure they'd be more appropriate for your task.
Running in parallel is certainly one option and i am exploring that.
Certainly agree there are specific programs designed for it; but i wanted to invest some time to find out something at the ground level.
I find shell script plus awk/sed/grep a great way to prototype a concept, but it's worth knowing the limitations of the tools you are using. I'd suggest sticking with a smaller subset and finalizing your prototype and in the background start researching text mining tools/relational databases etc.