Extract file names from a file

jricks · August 13, 2012, 1:32pm

I'm trying to extract a list of each .cfg file name mentioned in a file. I've made some progress using sed, but I'm still not there. Any help would be appreciated.

My input looks like this:

07:00:00.000  spn  redo       [4, 00:53:00, d:/cfg/apple1.cfg, MARY, d:/cfg/apple2.cfg, d:/cfg/pear.cfg, TRUE, FALSE, TRUE]
08:04:36.200  CMD  OBJ_INIT   [JOHN, d:/cfg/apple3.cfg]
08:04:37.200  CMD  OBJ_INIT   [JOE, d:/cfg/pear2.cfg]
07:53:26.200  CMD  OBJ_INIT   [SUE, d:/cfg/apple4.cfg]
06:27:49.717  CMD  OBJ_INIT   [BOB, d:/cfg/pear3.cfg]
06:12:51.717  CMD  OBJ_INIT   [SAM, d:/cfg/orange.cfg]
06:27:50.717  CMD  OBJ_INIT   [SAM, d:/cfg/orange2.cfg]
06:13:10.017  CMD  OBJ_INIT   [TONY, d:/cfg/grape.cfg]
07:00:00.000  spn  redo       [4, 00:53:00, d:/cfg/apple5.cfg, MARY, d:/cfg/apple.cfg, d:/cfg/pear4.cfg, TRUE, FALSE, TRUE]
08:04:36.200  CMD  OBJ_INIT   [JOHN, d:/cfg/apple6.cfg]

And I've been using

sed 's/^.*\(d:\)/\1/'

which has given me this:

d:/cfg/pear.cfg, TRUE, FALSE, TRUE]
d:/cfg/apple3.cfg]
d:/cfg/pear2.cfg]
d:/cfg/apple4.cfg]
d:/cfg/pear3.cfg]
d:/cfg/orange.cfg]
d:/cfg/orange2.cfg]
d:/cfg/grape.cfg]
d:/cfg/pear4.cfg, TRUE, FALSE, TRUE]
d:/cfg/apple6.cfg]

But I don't get the multiple filenames when they are on the same row, and I can't figure how to drop the stuff after the .cfg. What I want is this:

d:/cfg/apple1.cfg
d:/cfg/apple2.cfg
d:/cfg/pear.cfg
d:/cfg/apple3.cfg
d:/cfg/pear2.cfg
d:/cfg/apple4.cfg
d:/cfg/pear3.cfg
d:/cfg/orange.cfg
d:/cfg/orange2.cfg
d:/cfg/grape.cfg
d:/cfg/apple5.cfg
d:/cfg/apple.cfg
d:/cfg/pear4.cfg
d:/cfg/apple6.cfg

victorbrca · August 13, 2012, 1:44pm

I'm sure this is not the best way... but it's a way...

sed 's/\.cfg/\.cfg\n/g' | grep '.cfg' | awk -F"/" '{print $3}'

jricks · August 13, 2012, 1:59pm

Well, that got me the first entry of multi-entry lines instead of the last, but still don't get all entries. And it chopped off the d:/cfg/ part of the string which I need to keep. Here's the output of

sed 's/\.cfg/\.cfg\n/g' | grep '.cfg' | awk -F"/" '{print $3}'

apple1.cfg, MARY, d:
apple3.cfg]
pear2.cfg]
apple4.cfg]
pear3.cfg]
orange.cfg]
orange2.cfg]
grape.cfg]
apple5.cfg, MARY, d:
apple6.cfg]

victorbrca · August 13, 2012, 2:13pm

This works for me:

sed 's/\.cfg/\.cfg\n/g' | grep '.cfg' | sed 's/\(.*\)\(d:*.cfg\)/\2/'

Depending on the shell you are using you might have issues with the new line part in sed:

sed 's/\.cfg/\.cfg\n/g'

Also note that this assumes that "d:" will never change and that all files end with ".cfg".

RudiC · August 13, 2012, 2:38pm

This is close to victorbrca's proposal, works under the assumptions he mentioned, and may benefit from some polishing:

sed  's/\.cfg/\.cfg\n/g' inputfile |sed -n '/d:.*\.cfg/ s/.*\(d:.*cfg\)/\1/p'

jricks · August 13, 2012, 2:55pm

Thanks RudiC - I don't know how this works, but it works!!

Don_Cragun · August 13, 2012, 4:04pm

Another approach (using awk instead of sed) that seems to work is:

awk ' { for(i=1; i<= NF; i++) {
                if (sub("[.]cfg.*",".cfg",$i) == 1)
                        print $i
        }
}' file

RudiC · August 14, 2012, 4:04am

For GNU sed version 4.2.1 this works using just one sed command:

sed  -n 's/d:/\n&/g; s/\.cfg/&\n/g; s/^\n*//g; /^d:/ P; D'

It surrounds the filenames (d:...cfg) with <newline> chars, eliminates multiple <newline>s at the beginning of the line, up to the next <newline> it prints out that part if starting with "d:" , and then deletes that part.

elixir_sinari · August 14, 2012, 4:42am

Using any sed:

sed -n 'h
:again
g
s/\([[:alpha:]]:[^.]*\.cfg\).*/\1/
s/.*\([[:alpha:]]:[^.]*\.cfg\)/\1/p
g
s/[[:alpha:]]:[^.]*\.cfg//
h
t again' infile

And with awk:

awk '{while(match($0,/[[:alpha:]]:[^.]*\.cfg/))
{
 print substr($0,RSTART,RLENGTH)
 sub(/[[:alpha:]]:[^.]*\.cfg/,"")
}
}' infile