String matching question

Katkota · October 20, 2007, 3:43pm

Folks;
I need help with this:
I have a text file has a lot of lines, each line is a string consists of tree of directries, i would like to ignore any lines starting with "#" then grep an exact match of a string, then if i find a matching string with a child directory print it out. Below is the details:

The text file looks something like:

/new/tree/xxx/yyy/zzz
#/new/free/opt/yyy
/aaa/bbb/ccc/
/aaa/bbb/ccc/ddd/eee
/aaa/bbb/ccc/ddd

Now, i want first to ignore any line starts with "#"

Second, i want to do the following for EACH line starting with the first:

look for exact string matching that line
then

if the matching string has any extra children, Ignore it. "directories under the string"
If there's no child directory under the string, print out the string then add this phrase to it "/hello/every/one" and redirect the output to a new text file.
This process should do that for each line in the original text file.

Thanks in advance

bakunin · October 20, 2007, 7:56pm

Usually the recipe for writing good regular expressions is to phrase the problem correctly - most of the times this alone is providing the solution.

In your case you were *almost* there already, so this is simple:

First we filter out all the lines starting with "#". This is done by a special regexp device: "^" if used at the beginning of a regexp, means "start of line". That is: "^#" doesn't mean a caret-char followed by a octothorpe, but an octothorpe as the first char of a line. Here is the script with some sample text, everything filtered out is marked blue:

sed '/^#/d' file > file.changed

this line goes through
# this line is blocked
this line goes through even if it has a # in it

Now for the next problem: match a line with an exact content and print it (to a file). Your problem with the child directories could be stated as "match a line with a content and no additional content". We achieve this by using a similar device as above: the "$" at the end of a regexp means "end of line". That is: "x$" means not "x followed by a dollar sign", but "x as the last character of a line".

By the way, as we are just searching for specific lines and ignore all the others we could simply skip the filtering out of the lines starting with an octothorpe ("#"), as we won't find them anyways. we can simply turn off any output of sed (the -n option) and only explicitly print the found lines. I let the filter for the commentary lines in there, but it is redundant.

Here it is with a sample text, i marked blue what is printed out:

sed -n '/^#/d;/^this is my text to find$/p' file > file.changed

# this line is blocked by rule 1
this is my text to find but with additional text
this is my text to find

now for the last part, the adding of the additional parts: we simply change the rule 2 which finds and prints the text to a substitution. We use here the sed-capability to provide the matched part of the text in the output. The "&" in the substitution contains what we have really matched in the search expression:

sed -n '/^#/d;s/^this is my text to find$/& with added text/p' file > file.changed

# this line is blocked by rule 1
this is my text to find but with additional text
this is my text to find

The content of file.changed should be a single line "this is my text to find with added text".

We get back to your problem again: in your text there are slash-characters and as "/" is a part of the sed-syntax too you will have to "escape" it by putting a "\" in front of it: to match "/usr/bin" use the expression "\/usr\/bin".

Furthermore, it is most of the times a good idea to clear any unnecessary whitespace from a line prior to matching it. Most of the times we do NOT want to get trailing or leading blanks, tabs, etc. in the way and "match" and "<tab><blank>match" are quite the same. So I would write it that way ("<spc>" is a literal space, "<tab>" is a TAB character):

sed -n 's/^[<tab><spc>]*//
        s/[<tab><spc>]*$//
        /^#*/d
        s/^\/the\/directory\/to\/find$/&/hello\/every\/one/p' > file.changed

Here is a last tip: when you prepare regexps, test them against short texts Prepare the most difficult examples you can think of. Notice four kinds of lines and try to provide one in each category:

The ones that are matched and should be matched;
the ones that are matched but shouldn't be matched;
The ones that are not matched but should be matched;
the ones that are not matched and correctly so.

bakunin

aigles · October 21, 2007, 1:22pm

A possible solution with sort and awk :

sort katkota.dat | \
awk '

   function print_if_no_child(curr_path) {
      if (match(curr_path, "^" path "/") == 0)
         print path, "/hello/every/one";
      else print path, "directories under the string";
   }

   /^#/         { next }
   path_cnt > 0 { print_if_no_child($0) }
                { gsub(/\/*$/, ""); path = $0 ; path_cnt++ }
   END          { print_if_no_child("") }
'

Input File:

/new/tree/xxx/yyy/zzz
#/new/free/opt/yyy
/aaa/bbb/ccc/
/aaa/bbb/ccc/ddd/eee
/aaa/bbb/ccc/ddd
/aaa/xxx
/aaa/xxx/yyy/zzz
#end of datas

Output:

/aaa/bbb/ccc directories under the string
/aaa/bbb/ccc/ddd directories under the string
/aaa/bbb/ccc/ddd/eee /hello/every/one
/aaa/xxx directories under the string
/aaa/xxx/yyy/zzz /hello/every/one
/new/tree/xxx/yyy/zzz  /hello/every/one

Jean-Pierre.

Katkota · October 21, 2007, 5:05pm

Folks;
I very much appreciate your help, but now there's some changes to the requirements (I apologize for the confusion), but i would appreciate if i can get some help with it:

Now i need look for each line (lines consist of a directory trees), then for each tree, i need to search throughout the file to find the shortest one "the tree with no children", then append a text phrase to it & redirect the output to a new text file:
in details:

Let's say the text file looks like:

/aa/bb/cc/dd/ee
/xxx/yyy/zzz
/aa/bb/cc
/xxx/yyy/zzz/fff/nnn
/aa/bb/cc/dd
/mm/uu/ss/tt/rr
/mm/uu/ss/tt

for this sample, i should search the first line, then find a similar tree but keep looking until i find the one with the shortest tree, which in this example is "/aa/bb/cc" which has only three directories, since the other two lines in the file have longer paths trees (one is /aa/bb/cc/dd/ee & the other is /aa/bb/cc/dd).
so after i extract the shortest "/aa/bb/cc" append a phrase or another folder like "plus" to look like "/aa/bb/cc/plus" then redirect this result to a new text file.
Now i go to the second line & do the same thing.

i hope i explained it well.

Once again, i appreciate the help.

radoulov · October 22, 2007, 4:33am

If I understand correctly and sorting is acceptable:

sort file|awk '!x[$2]++&&$0=$0"/plus"' FS="/">new_text_file

Use nawk or /usr/xpg4/bin/awk on Solaris.

aigles · October 22, 2007, 7:28am

Try and adapt the following script :

sort katkota.dat | \
awk '
   /^#/ { next }
   { gsub(/\/*[[:space:]]*$/, ""); if (! root) root=$0}
   root { if (match($0 "/", "^" root "/")==0) {
        print root "/file"
        root = $0
     }
   }
   END { print "root "/file" }
'

Input:

/a
/b
/c
/usr
/new/tree
/new/tree/xxx/yyy/zzz
#/new/free/opt/yyy
/aaa/bbb/ccc/
/aaa/bbb/ccc/ddd/eee
/aaa/bbb/ccc/ddd
/aaa/xxx
/aaa/xxx/yyy/zzz
#end of datas

Output:

k2.sh
/a/file
/aaa/bbb/ccc/file
/aaa/xxx/file
/b/file
/c/file
/new/tree/file
/usr/file

Jean-Pierre.

moe2266 · October 22, 2007, 12:28pm

Thanks a lot.
But Aigles, could you please explain your code to me, i'm a little puzzled with it?