awk : search last index in specific column

tarun.trehan · October 29, 2013, 10:25am

I am trying to search a given text in a file and find its last occurrence index. The task is to append the searched index in the same file but in a separate column. I am able to accomplish the task partially and looking for a solution.

Following is the detailed description:

names_file.txt

111|lkasjdfaaa|555|tarun trehan
111|aaa|55765|vikram batra
111|aaa|555|allzhere blog
65876111|aaa|555|allzhere android apps
111|hhhaaa|555|allzhere on facebook
111|aaa|555|contact updater utility
111|aaa|444555|help me

Assuming, i am trying to search last occurrence index of "n" in the name i.e. my last column of file. So, the output file should be as following where the last column is representing the last index of "n" in the name column:

Output File :

111|lkasjdfaaa|555|tarun trehan|12
65876111|aaa|555|allzhere android apps|11
111|hhhaaa|555|allzhere on facebook|11
111|aaa|555|contact updater utility|3

I tried the following command:

cat names_file.txt | awk -F"|" '{print $4}' | awk -F"n" 'NF>1{print $0"|"length($0) - length($NF)}'

It gives me the following results:

tarun trehan|12
allzhere android apps|11
allzhere on facebook|11
contact updater utility|3

How can i fetch the rest of the columns first awk and still scan the 4th column only. I want the desired output described above.

Yoda · October 29, 2013, 11:00am

Using awk match function:

awk -F\| '
        {
                for ( i = 1; i <= NF; i++ )
                {
                        n = match ( $i, /n[^n]*$/ )
                }
                if ( n )
                        print $0, n
        }
' OFS=\| names_file.txt

RavinderSingh13 · October 29, 2013, 11:11am

Excellent Yoda, I am big fan of you. Could you please explain me the code with the

match

please.

Thanks,
R. Singh

Yoda · October 29, 2013, 11:17am

Syntax:

match(string, regexp)

The match function searches the string, for the longest, leftmost substring matched by the regexp. It returns the character position, or index

Since match function searches the leftmost, I used regexp /n[^n]*$/ to match the last character n

Refer: GNU awk string functions

tarun.trehan · October 29, 2013, 12:08pm

Yoda,

Thanks for your reply. Appreciate your inputs.
I have the following queries:

i am getting a syntax error around the

n = match ( $i, /n[^n]*$/ )

..may be i am not pasting the character correctly.

This code is scanning all variables and i would like to search only the 4th column. I think i can do that by adding a if clause for i==4.
I am onto text mining a huge file set i.e. 10 million records. Is line by line scanning a viable option ??

Yoda · October 29, 2013, 12:11pm

i am getting a syntax error around the n = match ( $i, /n[^n]*$/ ) ..may be i am not pasting the character correctly.
Use nawk instead if you are using SunOS or Solaris
This code is scanning all variables and i would like to search only the 4th column. I think i can do that by adding a if clause for i==4.
Yes, remove the for loop and put $4 instead: n = match ( $4, /n[^n]*$/ )
I am onto text mining a huge file set i.e. 10 million records. Is line by line scanning a viable option ??
Yes.

RavinderSingh13 · October 29, 2013, 4:26pm

Hi Yoda,

Sorry to bother you, I have tried searching the match expression given by you in code but was not able to under stand it could you please exaplain it.

n = match ( $i, /n[^n]*$/ )

Thanks,
R. Singh

RudiC · October 29, 2013, 4:35pm

match also sets the RSTART and RLENGTH variables, so your proposal could be written as

awk -F\| 'match ($NF, /n[^n]*$/) {print $0, RSTART}' OFS=\| file
111|lkasjdfaaa|555|tarun trehan|12
65876111|aaa|555|allzhere android apps|11
111|hhhaaa|555|allzhere on facebook|11
111|aaa|555|contact updater utility|3

Yoda · October 29, 2013, 4:36pm

Regexp /n[^n]*$/ means search for a pattern n n followed by zero or more occurrence of any character other than n [^n]* at the end of string $

So for string RavinderSingh13 , this regexp will match the last n and return the index or character position:

$ echo "RavinderSingh13" | awk '{ print match ( $0, /n[^n]*$/ ) }'
11

Chubler_XL · October 29, 2013, 4:47pm

You could also use:

awk -F\| '
{ for(F=0; match(substr($4,F+1),"n"); F+=RSTART);
  $5=F
}
F' OFS=\| names_file.txt

Edit: RudiC's solution #8 is similar but better implemented!

However, this is still a valid method if the match item is more than 1 character (or even an RE in it's own right).

MadeInGermany · October 29, 2013, 8:33pm

deleted

tarun.trehan · October 30, 2013, 7:10am

Yoda,

Thanks for the inputs. nawk works but have another challenge here:
The string to search for is dynamic and hence needs to passed as a variable to awk.

I tried the following code though i was not confident about it.
Doesn't work as expected. Can you please provide your inputs here :

export SRCH_PARAM="an"
nawk -v X="$SRCH_PARAM" -F\| '
{
	for ( i = 1; i <= NF; i++ )
	{
			n = match ( $4, /n[^X]*$/ )
	}
	if ( n )
	print $0 OFS n
} ' OFS=\| names_file.txt

RudiC · October 30, 2013, 7:18am

awk -F\| -v X="an" 'match ($NF, ".*"X) {print $0, RLENGTH-length(X)+1}' OFS=\| file
111|lkasjdfaaa|555|tanrun trehan|12
65876111|aaa|555|allzhere android apps|10

Chubler_XL · October 30, 2013, 7:55am

Try this:

export SRCH_PARAM="an"
nawk -v X="$SRCH_PARAM" -F\| '
{ for(F=0; match(substr($4,F+1),X); F+=RSTART);
  $5=F
}
F' OFS=\| names_file.txt

tarun.trehan · November 13, 2013, 12:21am

All,

Thanks for your inputs.
The suggested solution works for me

awk -F\| -v X="an" 'match ($NF, ".*"X) {print $0, RLENGTH-length(X)+1}' OFS=\| file

The project involves searching a list of keywords in a master input file.

master_file.txt containing strings. Record Count : 20000000 String Records.
keywords.txt containing keywords. Record Count : 200,000 Unique Keywords.

I run a shell script which reads a keyword and runs the above mentioned command to search in master_file.txt to append desired output.

The end result is being achieved as per our expectations.
However, i am concerned about the performance and response time of this utility.
I tried with 16000 keywords and 20000 master records and the process took around 25 minutes.

I am looking to reduce this number and considered the following:

Split up file into n part and run searches in parallel and then collate results?
Possible tweaking for commands ?
Is text mining in shell correct from a design and feasibility perspective ?

Please provide your inputs.

RudiC · November 13, 2013, 9:41am

You shouldn't run one command (= one process) per keyword, that's definitely too ineffective. On the other hand, the numbers you indicated might be too high for grep or awk . There's programs/applications/databases out there designed for text mining - I'm sure they'd be more appropriate for your task.

tarun.trehan · November 18, 2013, 2:41am

Thanks Rudi.

Running in parallel is certainly one option and i am exploring that.
Certainly agree there are specific programs designed for it; but i wanted to invest some time to find out something at the ground level.

Chubler_XL · November 18, 2013, 11:45am

I find shell script plus awk/sed/grep a great way to prototype a concept, but it's worth knowing the limitations of the tools you are using. I'd suggest sticking with a smaller subset and finalizing your prototype and in the background start researching text mining tools/relational databases etc.