Finding Strings between 2 characters in a file

rtagarra · March 2, 2013, 2:57pm

Hi All,

Assuming i have got a file test.dat which has contains as follows:

   Unix = abc def fgt jug
            111 2222 3333

   Linux = gggg pppp qqq

   C# = ccc ffff llll

I would like to traverse through the file, get the 1st occurance of "=" and then need to get the sting contained with in 1st occurance of "=" and 2nd occurance of "=", so that i can validate the string.
Now the same logic has to continue for string contained between 2nd occurance of "=" and 3rd occurance of "=". Could you please let me know if it is possible. Please note that my file may contain any number of "=",so the logic has to be generic. Since I am new to Unix, it would be great if u can guide about the complete code.

bartus11 · March 2, 2013, 3:00pm

What would be the desired output for this sample data?

rtagarra · March 2, 2013, 3:05pm

Thanks a ton for the reply. Basically i want to have a shell script, which would read the file and do some validations, eg in below example if

Unix = abc def fgt jug
111 2222 3333

Linux = gggg pppp qqq

if i get string between occurances of "=" , it would be (

abc def fgt jug
111 2222 3333

). Now since they follow the word Unix=, i want to put a validation so that all these words should have an extension of .sh...similarly for other types...i hope i am clear

Yoda · March 2, 2013, 7:30pm

Your requirement is still not clear.

By the way in your previous thread we provided you an approach to read string separated by =

So what have you tried so far to solve your current requirement? If you are stuck at some point, then give us the details.

RudiC · March 3, 2013, 4:06am

Please use code tags as required by forum rules!

Your requirement is inconsistent with your desired output. The first string between occurrences of "=" is

 abc def fgt jug
             111 2222 3333

   Linux

On top of bipinajith's reasonable questions, pls post a consistent request.

shamrock · March 3, 2013, 1:48pm

Try this awk script...

awk -F"[=]" 'BEGIN{RS=""} {print $2}' test.dat

What exactly do you mean by the above statement...and for validation you would need to initialise an associative array with all known OSes and languages...

rtagarra · March 10, 2013, 5:51am

Hi Shamrock,

Sincere aplogize for delay in response. Your solution is perfectly working fine. But the only issue being it gets me all the details in one go. let me again put my requirement. Below is my sample file:

C = test1.c test2.c test3.c
      test4.c test5.c

perl = perl1.pl perl2.pl

C++ = test1.cpp test2.cpp

Now my requirement is to read this file from a shell script, get the words (test1.c test2.c test3.c,test4.c test5.c) and since these files are followed by "C =", i need to check if they have got file extension as ".c".
Similarly next i need to get (perl1.pl perl2.pl) and since these 2 words follow "perl = " i need to check if they have got ".pl" as extension. Similarly for last line C++.

Please note that the data file can have any number of lines.. I hope this time i am clear with requirement.

The solution provided by you getting all the details in one go ie

test1.c test2.c test3.c,test4.c test5.c
perl1.pl perl2.pl
test1.cpp test2.cpp

and really thanks to everyone for their help

Scrutinizer · March 10, 2013, 6:19am

One way might be:

awk '{$1=$1; sub($1 FS $2 FS,x)}1' RS= file

But only if the empty lines are completely empty (contain no whitespace).

rtagarra · March 10, 2013, 7:01am

Thanks for the reply. I am not sure if empty lines will be completely empty, can we do one thing if that does not need any kind of assumption...I am fine closing the data in the end with ";", so is there anyway we can get the string which starts with "=" and ends with ";".
Your solution again getting me the all lines of file in one go...requesting to please let me know if i can get one by one as explained in the post. Once again thanks for spending time on this on helping me.

---------- Post updated at 06:01 AM ---------- Previous update was at 05:34 AM ----------

Also please note that it is not a fixed length file that means the number of words are not fixed...

Scrutinizer · March 10, 2013, 7:20am

One thing that is not clear is, where does the script get the information that "C" files need an extension ".c" and C++ files need ".cpp" etc..

rtagarra · March 10, 2013, 7:22am

that information will not come from anywhere...once i am able to extract the strings...then i need to write function which will check for validation...but in 1st place i dont know how to extract the string...I hope I am clear...do we have any instr oracle like command in Unix..if something like that is there...then i can simple get the position of ";" and then from there i can get substring using substr...Any advice please..

Scrutinizer · March 10, 2013, 7:42am

Perhaps something like this is more what you are looking for:

#!/bin/bash
while read -a line
do
  if [[ ${line[1]} == "=" ]]; then
    ptype=${line[0]}
    unset line[0] line[1]
  fi
  for file in "${line[@]}"
  do
    echo "Check if \"$file\" belongs to type \"$ptype\""
  done
done < infile

rtagarra · March 10, 2013, 9:50am

Thank You...This is what i was looking for. If possible could you please confirm the following points:line[1] does it mean, "=" operator in my file has to be 2nd word in each line without any space.

Scrutinizer · March 10, 2013, 9:51am

Yes, that is correct.

rtagarra · March 10, 2013, 9:58am

I am sorry but may be i was not clear in my requirement. The script you provided i think expects the file should contain only lines with "="..but my file may have some header section like comments to user how to use the data file. Now those data also come under validation. Is there anyway i can navigate to first occurance of "=" in file and from there the script which you provided should work fine i believe.

---------- Post updated at 08:58 AM ---------- Previous update was at 08:53 AM ----------

I think i can handle it by putting one more condition that if ptype is null then do not go for validation, so basically then only those lines which have got "=" will be validated. Thanks a lot for your help

Scrutinizer · March 10, 2013, 10:01am

You could use a "#" to start the header lines and then test inside the loop:

[[ ${line[0]} == "#" ]] && continue

Or, you could set a boolean when the first = is encountered and test for that later on. You can determine the format of the file and you can test for it, so it is up to you:).

--edit--
our posts crossed.. you're welcome..

rtagarra · March 10, 2013, 10:10am

Thanks a lot for your help..I got one off work on Shell script and hence struggling..thansk a lot again and have a great day

Yoda · March 10, 2013, 1:03pm

Here is another bash solution:

#!/bin/bash

while read line
do
        [[ "$line" =~ "=" ]] && ptype="${line%% *}"

        fname="${line#*=}"

        for F in ${fname}
        do
                if [ "${ptype}" = "C" ]
                then
                        [[ "$F" =~ \.c$ ]] && c_F="${c_F} ${F}"
                fi

                if [ "${ptype}" = "perl" ]
                then
                        [[ "$F" =~ \.pl$ ]] && pl_F="${pl_F} ${F}"
                fi

                if [ "${ptype}" = "C++" ]
                then
                        [[ "$F" =~ \.cpp$ ]] && cpp_F="${cpp_F} ${F}"
                fi
        done

done < file

echo -e "C\t${c_F}"
echo -e "Perl\t${pl_F}"
echo -e "C++\t${cpp_F}"

Producing o/p:

C        test1.c test2.c test3.c test4.c test5.c
Perl     perl1.pl perl2.pl
C++      test1.cpp test2.cpp

rtagarra · March 10, 2013, 1:12pm

Hi Bipin,

Thanks for the reply. I am sorry but i could not understand much in the code.

        [[ "$line" =~ "=" ]] && ptype="${line%% *}"
(what the above line does)

        fname="${line#*=}"  (I think this will return the file extension)

        for F in ${fname}
        do
                if [ "${ptype}" = "C" ] 
                then
                        [[ "$F" =~ \.c$ ]] && c_F="${c_F} ${F}"
      ( No clue what the above line does)
                fi

Could you please explain them...pls

Yoda · March 10, 2013, 1:31pm

These are Shell Parameter Expansion

Below line is using regexp comparison operator =~ to check if line has pattern = If yes, then remove everything followed by first blank space %% * & assign to variable ptype :

[[ "$line" =~ "=" ]] && ptype="${line%% *}"

Below line is removing everything #*= before = sign:

fname="${line#*=}"

Below line is checking if the extension is .c if yes, concatenate file name to variable c_F value:

[[ "$F" =~ \.c$ ]] && c_F="${c_F} ${F}"