Get nth occurence of string from a file

I have file in which the data looks like this,

01,0000000,xxxxxxx/                        
02,xxxxxxxx,yyyyyy/                                    
03,test1,41203016,,/      
01,0000000,xxxxxxx/                        
02,xxxxxxxx,yyyyyy/                                    
03,test2,41203017,,/     
01,0000000,xxxxxxx/                        
02,xxxxxxxx,yyyyyy/                                    
03,test3,41203018,,/  

I would not know what values will be there in place of test1, test2 and test3 in my file.

I wrote a command to get these values from the file,

grep "^03,.*,.*$" file.dat | cut -d"," -f2

this returns,

test1
test2
test3

Is there a way I can get a specific occurence (ex: 5th or 9th) of this string. let say in this case, can I get only test2?

Appreciate your help.

Assuming your code works correctly for your data:

recnum=4
grep "^03,.*,.*$" file.dat | cut -d"," -f2 |  awk -v r_num=$recnum 'NR==r_num'

The reason for my caveat is your regex - looks unusual, but could be just right for your dataset.

1 Like

Please use code tags for all code, data, and output.

awk is a powerful language for this sort of thing, you can match strings like grep, but you can also count the number of lines and columns, tell it how columns are separated, count things and use variables. You'll find it on any UNIX or UNIX-like system.

$ awk -F"," -v X="03" -v N=2 '($1 == X) && ((++L)>=(N+0)) { print $2 ; exit }' inputfile

test2

$

-F"," tells it to consider , as column splitters.

-v X="03" and -v N=2 preset the X and N variables inside awk. awk has its own variables independent of the shell's.

($1 == X) && ((++L) ==(N+0)) { ... } means "run the following code when the first column equals the variable X, and L is equal to N". $ means column in awk, not variable. $1 would be 01, $2 would be 0000000, etc. "++L" adds one to L every time ($1 == X) is true.

I did (N+0) to make sure awk used N as a number, not a string.

{ print $2 ; exit } means "print the second column, then stop reading this file".

So, a combined grep, wc, and cut.

1 Like

Thanks a lot for the command and a beautiful explanation.

---------- Post updated 04-10-15 at 02:34 PM ---------- Previous update was 04-09-15 at 02:53 PM ----------

I have an additional requirement for this,

I need to replace this value with some other value.
Is there a way I can do this by using a command in conjunction with this awk command.

$ awk -F"," -v X="03" -v N=2 '($1 == X) && ((++L)>=(N+0)) { print $2 ; exit }' inputfile

In this test2 should be replaced with replace2.

Please help.

how about this?

{ $2="replace2"; print $2 ; exit }'

Thanks for your reply.

I tried this..

awk -F"," -v X="03" -v N=6 '($1 == X) && ((++L)>=(N+0)) { $2="replace2"; print $2 ; exit }' myfile.txt > temp.txt

but this is replacing whole content of my file with replace2.

care to give an example of what you are looking for? exact input and exact output samples would help.

1 Like

My txt file has data something like this,

01,0000000,xxxxxxx/                        
02,xxxxxxxx,yyyyyy/                                    
03,test1,41203016,,/      
01,0000000,xxxxxxx/                        
02,xxxxxxxx,yyyyyy/                                    
03,test2,41203017,,/     
01,0000000,xxxxxxx/     
04,xxxxx,00000,test2                   
02,xxxxxxxx,yyyyyy/                                    
03,test3,41203018,,/

I need to get the string 2nd column value after 03.
That means in this case it is test1, test2 and test3. The same string may exist in other lines too (like test2 exist in line starting 04 which I dont want to replace).

So this is what I am expecting..
get the value 'test1' and I am passing this to a prog. and getting replace string 'replace1' and then need to replace with this in the file.
similarly for test2 and test3.

I used this to get each occurence value, put it in variable and use sed to replace. But this is replacing 'test2' string in both line starting with 03 and 04.

$testvalue=`awk -F"," -v X="03" -v N=2 '($1 == X) && ((++L)>=(N+0)) { $2="replace2"; print $2 ; exit }'

$replacevalue='replace2'

sed "s/$testvalue/$replacevalue/" $1 > temp.txt

like this

01,0000000,xxxxxxx/                        
02,xxxxxxxx,yyyyyy/                                    
03,replace1,41203016,,/      
01,0000000,xxxxxxx/                        
02,xxxxxxxx,yyyyyy/                                    
03,replace2,41203017,,/     
01,0000000,xxxxxxx/     
04,xxxxx,00000,replace2                   
02,xxxxxxxx,yyyyyy/                                    
03,replace3,41203018,,/

So I am looking for single command which gets the string and also replaces it.

Try

awk -F, '$1=="03"{gsub(/test/,"replace",$2)}1' OFS="," file

Its working when I hard code the values but when I use the variables like this the its not replacing..

awk -F, '$1=="03"{gsub(/$l_test/,$l_replace,$2)}1' OFS="," file.txt > temp.txt

You haven't defined any variables in this awk script except FS and OFS . In an ERE, $l_test is looking for end of field 2 followed by the string l_test in field 2 (which it will NEVER find) and for each time that it is found, will replace it with the entire input line (since undefined awk variables expand to 0 or an empty string depending on context, and in awk $ followed by a field number refers to the contents of that field). If you are saying that you have defined shell variables in your shell script:

l_test="test"
l_replace="replace"

and you want to use those variables as the extended regular expression search pattern and substitution replacement specification, respectively, inside your awk script, that could be done with something like:

l_test="test"
l_replace="replace"
awk -F, -v ERE="$l_test" -v rep="$l_replace" '$1=="03"{gsub(ERE,rep,$2)}1' OFS="," file.txt > temp.txt

Note that if your search pattern occurs multiple times in field 2 in your input, each occurrence will be replaced. If you just want to replace the first occurrence of the search pattern, you should change:

gsub(ERE,rep,$2)

to:

sub(ERE,rep,$2)
1 Like

Thanks a lot. That was really helpful.
But even though I changed it to below, it is replacing mutiple occurrence's. Am I missing something.

sub(ERE,rep,$2)

[/quote]

Using gsub(ERE,rep,2 replaces every string matched by ERE in field 2 on each line with the replacement string indicated by rep .

Using sub(ERE,rep,2 replaces the 1st string matched by ERE in field 2 on each line with the replacement string indicated by rep .

With the sample input you provided and the sample output you said you wanted, both of these do exactly what you said you wanted. You did not provide any sample input with the string test appearing two or more times in field 2 on a line where the 1st field on that line is the string 03 .

you are right. But is there a way we can only replace only first occurrence of the string in a file for line starting with

03

.

What in " replaces the 1st string matched" is unclear? Or do you mean first occurrence in total file only?

Untested, but this should work...

l_test="test"
l_replace="replace"
awk -F, -v ERE="$l_test" -v rep="$l_replace" '$1=="03" && x==0{x=sub(ERE,rep,$2)}1' OFS="," file.txt > temp.txt

This Worked. Thank you Don.