Get nth occurence of string from a file

r111 · April 9, 2015, 5:12pm

I have file in which the data looks like this,

01,0000000,xxxxxxx/                        
02,xxxxxxxx,yyyyyy/                                    
03,test1,41203016,,/      
01,0000000,xxxxxxx/                        
02,xxxxxxxx,yyyyyy/                                    
03,test2,41203017,,/     
01,0000000,xxxxxxx/                        
02,xxxxxxxx,yyyyyy/                                    
03,test3,41203018,,/

I would not know what values will be there in place of test1, test2 and test3 in my file.

I wrote a command to get these values from the file,

grep "^03,.*,.*$" file.dat | cut -d"," -f2

this returns,

test1
test2
test3

Is there a way I can get a specific occurence (ex: 5th or 9th) of this string. let say in this case, can I get only test2?

Appreciate your help.

jim_mcnamara · April 9, 2015, 5:19pm

Assuming your code works correctly for your data:

recnum=4
grep "^03,.*,.*$" file.dat | cut -d"," -f2 |  awk -v r_num=$recnum 'NR==r_num'

The reason for my caveat is your regex - looks unusual, but could be just right for your dataset.

Corona688 · April 9, 2015, 5:30pm

Please use code tags for all code, data, and output.

awk is a powerful language for this sort of thing, you can match strings like grep, but you can also count the number of lines and columns, tell it how columns are separated, count things and use variables. You'll find it on any UNIX or UNIX-like system.

$ awk -F"," -v X="03" -v N=2 '($1 == X) && ((++L)>=(N+0)) { print $2 ; exit }' inputfile

test2

$

-F"," tells it to consider , as column splitters.

-v X="03" and -v N=2 preset the X and N variables inside awk. awk has its own variables independent of the shell's.

($1 == X) && ((++L) ==(N+0)) { ... } means "run the following code when the first column equals the variable X, and L is equal to N". $ means column in awk, not variable. $1 would be 01, $2 would be 0000000, etc. "++L" adds one to L every time ($1 == X) is true.

I did (N+0) to make sure awk used N as a number, not a string.

{ print $2 ; exit } means "print the second column, then stop reading this file".

So, a combined grep, wc, and cut.

r111 · April 10, 2015, 5:34pm

Thanks a lot for the command and a beautiful explanation.

---------- Post updated 04-10-15 at 02:34 PM ---------- Previous update was 04-09-15 at 02:53 PM ----------

I have an additional requirement for this,

I need to replace this value with some other value.
Is there a way I can do this by using a command in conjunction with this awk command.

$ awk -F"," -v X="03" -v N=2 '($1 == X) && ((++L)>=(N+0)) { print $2 ; exit }' inputfile

In this test2 should be replaced with replace2.

Please help.

senhia83 · April 10, 2015, 5:56pm

how about this?

{ $2="replace2"; print $2 ; exit }'

r111 · April 10, 2015, 6:04pm

Thanks for your reply.

I tried this..

awk -F"," -v X="03" -v N=6 '($1 == X) && ((++L)>=(N+0)) { $2="replace2"; print $2 ; exit }' myfile.txt > temp.txt

but this is replacing whole content of my file with replace2.

senhia83 · April 10, 2015, 7:33pm

r@v!7*7@:

Thanks for your reply.

I tried this..
awk -F"," -v X="03" -v N=6 '($1 == X) && ((++L)>=(N+0)) { $2="replace2"; print $2 ; exit }' myfile.txt > temp.txt
but this is replacing whole content of my file with replace2.

care to give an example of what you are looking for? exact input and exact output samples would help.

r111 · April 10, 2015, 7:54pm

My txt file has data something like this,

01,0000000,xxxxxxx/                        
02,xxxxxxxx,yyyyyy/                                    
03,test1,41203016,,/      
01,0000000,xxxxxxx/                        
02,xxxxxxxx,yyyyyy/                                    
03,test2,41203017,,/     
01,0000000,xxxxxxx/     
04,xxxxx,00000,test2                   
02,xxxxxxxx,yyyyyy/                                    
03,test3,41203018,,/

I need to get the string 2nd column value after 03.
That means in this case it is test1, test2 and test3. The same string may exist in other lines too (like test2 exist in line starting 04 which I dont want to replace).

So this is what I am expecting..
get the value 'test1' and I am passing this to a prog. and getting replace string 'replace1' and then need to replace with this in the file.
similarly for test2 and test3.

I used this to get each occurence value, put it in variable and use sed to replace. But this is replacing 'test2' string in both line starting with 03 and 04.

$testvalue=`awk -F"," -v X="03" -v N=2 '($1 == X) && ((++L)>=(N+0)) { $2="replace2"; print $2 ; exit }'

$replacevalue='replace2'

sed "s/$testvalue/$replacevalue/" $1 > temp.txt

like this

01,0000000,xxxxxxx/                        
02,xxxxxxxx,yyyyyy/                                    
03,replace1,41203016,,/      
01,0000000,xxxxxxx/                        
02,xxxxxxxx,yyyyyy/                                    
03,replace2,41203017,,/     
01,0000000,xxxxxxx/     
04,xxxxx,00000,replace2                   
02,xxxxxxxx,yyyyyy/                                    
03,replace3,41203018,,/

So I am looking for single command which gets the string and also replaces it.

senhia83 · April 10, 2015, 8:02pm

Try

awk -F, '$1=="03"{gsub(/test/,"replace",$2)}1' OFS="," file

r111 · April 10, 2015, 8:28pm

Its working when I hard code the values but when I use the variables like this the its not replacing..

awk -F, '$1=="03"{gsub(/$l_test/,$l_replace,$2)}1' OFS="," file.txt > temp.txt

Don_Cragun · April 11, 2015, 1:04am

You haven't defined any variables in this awk script except FS and OFS . In an ERE, $l_test is looking for end of field 2 followed by the string l_test in field 2 (which it will NEVER find) and for each time that it is found, will replace it with the entire input line (since undefined awk variables expand to 0 or an empty string depending on context, and in awk $ followed by a field number refers to the contents of that field). If you are saying that you have defined shell variables in your shell script:

l_test="test"
l_replace="replace"

and you want to use those variables as the extended regular expression search pattern and substitution replacement specification, respectively, inside your awk script, that could be done with something like:

l_test="test"
l_replace="replace"
awk -F, -v ERE="$l_test" -v rep="$l_replace" '$1=="03"{gsub(ERE,rep,$2)}1' OFS="," file.txt > temp.txt

Note that if your search pattern occurs multiple times in field 2 in your input, each occurrence will be replaced. If you just want to replace the first occurrence of the search pattern, you should change:

gsub(ERE,rep,$2)

to:

sub(ERE,rep,$2)

r111 · April 11, 2015, 1:34am

don cragun:

You haven't defined any variables in this awk script except FS and OFS . In an ERE, $l_test is looking for end of field 2 followed by the string l_test in field 2 (which it will NEVER find) and for each time that it is found, will replace it with the entire input line (since undefined awk variables expand to 0 or an empty string depending on context, and in awk $ followed by a field number refers to the contents of that field). If you are saying that you have defined shell variables in your shell script:
l_test="test"
l_replace="replace"
and you want to use those variables as the extended regular expression search pattern and substitution replacement specification, respectively, inside your awk script, that could be done with something like:
l_test="test"
l_replace="replace"
awk -F, -v ERE="$l_test" -v rep="$l_replace" '$1=="03"{gsub(ERE,rep,$2)}1' OFS="," file.txt > temp.txt
Note that if your search pattern occurs multiple times in field 2 in your input, each occurrence will be replaced. If you just want to replace the first occurrence of the search pattern, you should change:
gsub(ERE,rep,$2)
to:
sub(ERE,rep,$2)

Thanks a lot. That was really helpful.
But even though I changed it to below, it is replacing mutiple occurrence's. Am I missing something.

sub(ERE,rep,$2)

[/quote]

Don_Cragun · April 11, 2015, 3:31am

Using gsub(ERE,rep,2 replaces every string matched by ERE in field 2 on each line with the replacement string indicated by rep .

Using sub(ERE,rep,2 replaces the 1st string matched by ERE in field 2 on each line with the replacement string indicated by rep .

With the sample input you provided and the sample output you said you wanted, both of these do exactly what you said you wanted. You did not provide any sample input with the string test appearing two or more times in field 2 on a line where the 1st field on that line is the string 03 .

r111 · April 11, 2015, 10:57am

you are right. But is there a way we can only replace only first occurrence of the string in a file for line starting with

.

RudiC · April 11, 2015, 2:18pm

What in " replaces the 1st string matched" is unclear? Or do you mean first occurrence in total file only?

Don_Cragun · April 11, 2015, 3:03pm

Untested, but this should work...

l_test="test"
l_replace="replace"
awk -F, -v ERE="$l_test" -v rep="$l_replace" '$1=="03" && x==0{x=sub(ERE,rep,$2)}1' OFS="," file.txt > temp.txt

r111 · April 11, 2015, 6:05pm

This Worked. Thank you Don.