Help - Search for string, then do string operation on line

deepaksinbox · August 17, 2009, 4:52am

Hi,

I wish to find all lines that contain a specific search word, and then do few string operations on that line. The idea is to "fix" the file which has been moved from windows to unix.

Using unix - Sun Solaris

Test input ("t2.sas")

 
statement1
statement2
libname  yahoo          '/analytics/CODE';
statement 3
libname google  "/analytics/india/CODE/test_DIR" ;
libname msn '/analytics/month end docs for sas/Wcode';
statement 4

Required Actions:

Find the line which contains "libname". This is not necessarily at the start of the line (e.g leading spaced), but it is always the first word on the line.
Convert the entire libname line to lowercase
Remove space ONLY in the path (defined by quotes, can be single or double) and replace with underscore.

Ideal output will be:

 
statement1
statement2
libname  yahoo          '/analytics/code';
statement 3
libname google  "/analytics/india/code/test_dir" ;
libname msn '/analytics/month_end_docs_for_sas/wcode';
statement 4

I can use sed, awk (no perl). Also I don't have gnu versions.

Thanks.

bakunin · August 17, 2009, 5:56am

This seems to be homework. There is a separate forum with specific rules for homework questions, so it won't be (and shouldn't be) answered here.

-closed-

/edit: deepaksinbox explained that it was NOT homework at all and i was wrong. My apologies, thread is open again.

-reopened-

bakunin

Franklin52 · August 17, 2009, 6:22am

Try this:

awk -F"\'" '/libname/{gsub(" ","_",$2)}{print tolower($0)}' OFS="\'" file |
awk -F"\"" '/libname/{gsub(" ","_",$2)}1' OFS="\""

Use nawk or /usr/xpg4/bin/awk on Solaris.

bakunin · August 17, 2009, 7:07am

The trick to do this is to use a "range" construct with sed. It looks like this:

/expression/ {
            command1
            command2
            ....
            }

All the commands will be executed only on the lines which contains "/expression/". Think of it as a sort-of "if....endif"-construct. If the expression is matched then all the commands inside the curly braces are being applied.

Applying this to your problem:

My first question would be if the "libname" could be mixed case too. Your (otherwise well stated) requirements are not completely clear on this. I suppose for now that "libcase" itself is always lower case. Lets start with your point 1 (in the following scripts non-printing characters are written: replace "<spc>" with a literal space, "<tab>" with a literal tab character):

sed -n '/^[<spc><tab>]*libname/ {
              p
              }' t2.sas

This will just print only the matched lines and will give us a hint if our regexp is correct so far. Analyze the output and ask yourself:

Are all the lines i want to match matched?
Are lines matched i do not want to be matched?

If not, this regexp would have to be refined. Suppose the test succeeded. On to your requirement 2: There is a special command for replacing a list of characters with another list of characters. We use this with the whole alphabet as list:

sed -n '/^[<spc><tab>]*libname/ {
              y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/
              p
              }' t2.sas

Again: let this run, analyze the output, check if it still is what you really want it to be. Don't care that only the lines we work on are printed for now, this is just to make the checks easier.

On to requirement 3: I take it by "space" you mean only space characters, not "white space" in general (which would include tab characters too). This is a tricky one, because we have to set up a sort-of loop to reapply the same regexp over and over until all space characters are replaced. Basically we use again a range-construct (see above) and inside this range have a "branch"-command which jumps back to the beginning of the loop after replacing the next space-char. Once all the space-chars are replace the range-expression won't be matched any more and execution continues:

sed -n '/^[<spc><tab>]*libname/ {
              y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/
              :loopstart
              /["'][^"']* [^"']*["']/ {
                     s/\(["'][^"']*\) /\1_/
                     b loopstart
                     }
              p
              }' t2.sas

Again, test extensively. This is a complicated regexp and errors are easy to make and sometimes difficult to spot. Once you are satisfied with the results, we make a lst modification to let pass through all the other lines we have filtered out so far for clarity:

sed '/^[<spc><tab>]*libname/ {
              y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/
              :loopstart
              /["'][^"']* [^"']*["']/ {
                     s/\(["'][^"']*\) /\1_/
                     b loopstart
                     }
              }' t2.sas

I hope this helps.

bakunin

deepaksinbox · August 18, 2009, 9:23am

Hi Bakunin,

First up, thanks a TON for the excellent solution and time taken to provide the explanation.

I have tested your solution, it works fine - in principle right now. I am facing a problem with one of the regex below, not sure whether it is related to the awk/platform I use:

This is not working:

 
/["'][^"']* [^"']*["']/

Here is the full output:

 
$ uname -a
SunOS sasuat1 5.10 Generic_138888-08 sun4u sparc SUNW,Sun-Fire-V890
$
$ cat t2.sas
statement1
statement2
libname  yahoo          '/analytics/CODE';
statement 3
libname google  "/analytics/india/CODE/test_DIR" ;
libname msn '/analytics/month end docs for sas/Wcode';
statement 4
$
$ sed '/^[<spc><tab>]*libname/ {
> y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/
> :lpstrt
> /["'][^"']* [^"']*["']/ {
']* [^]*["]/: not found
$ sed: command garbled: /["][

Something unmatched ?

Thanks

---------- Post updated at 06:53 PM ---------- Previous update was at 03:14 PM ----------

Additional info:

 
$ which awk
/usr/xpg4/bin/awk
$

bakunin · August 18, 2009, 5:59pm

deepaksinbox:

I have tested your solution, it works fine - in principle right now. I am facing a problem with one of the regex below, not sure whether it is related to the awk/platform I use:

This is not working:
 
/["'][^"']* [^"']*["']/ 

No, it is not platform-related, at least not in a simple manner. The problem is that the whole script is (single-)quoted and the usage of the single- and double-quotes is confusing the shell. sed complains because it doesn't even get what stands there from the shell. The problem can be reduced to this:

sed 's/something/other/' -> works
sed 's/'/something/' -> will not work

because for the shell parsing this, there is one quoted string "'s/'", then an unquoted string "/something/" and then the begin of another quoted string which closing quote is missing. This will lead to an error.

Probably there is a more elegant solution to this, but this is the simplest one which came to me: replace all the problematic characters to something unlikely to occur in the text, and only in the end swap the quotation marks back in. As you can see a single quote can be used in a double-quoted string without problems, but in a single-quoted string escaping won't work.

Here is the new version, which runs perfectly against your test-data on my Linux (Ubuntu) system ("->@@@,'->@@):

sed -e "s/'/@@/g;s/\"/@@@/g" \
    -e '/^[<spc><tab>]*libname/ {
              y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/
              :loopstart
              /@@[^@]* [^@]*@@/ {
                     s/\(@@[^@]*\) /\1_/
                     b loopstart
                     }
              }' \
    -e "s/@@@/\"/g;s/@@/'/g" t2.sas

I hope this helps.

bakunin

deepaksinbox · August 19, 2009, 10:33am

Bakunin,

Works perfectly fine now, and solves my problem. Brilliant stuff, thanks a ton again.

Cheers