(g)awk how to preseve white spaces (FS characters) or read a right subpart of $0?

shri_nath · July 28, 2009, 1:49am

Hi,
I am using gawk (--posix) for extracting some information from something like the following lines (in a text file):

sms_snath_hp_C/CORE BUILD PREREQUISITE:
total 1556
drwxrwxrwx 2 sn sn 4096 2008-06-27 08:31 ./
drwxrwxrwx 13 sn sn 4096 2009-07-22 14:48 ../
-rwxrwxrwx 1 sn sn 15348 2007-05-11 08:37 This is a file name with seven spaces.jar*
-rwxrwxrwx 1 sn sn 22395 2007-05-11 08:37 This is a file name with eight spaces.jar*
-rwxrwxrwx 1 sn sn 73687 2007-05-11 08:37 ibmjcefw.jar*
-rwxrwxrwx 1 sn sn 767101 2007-05-11 08:37 ibmjceprovider.jar*

With regular expressions (pattern matching) I am ignoring all the lines except the ones which are NOT directories with long listing format.

So I consider only:
-rwxrwxrwx 1 sn sn 15348 2007-05-11 08:37 This is a file name with seven spaces.jar*
-rwxrwxrwx 1 sn sn 22395 2007-05-11 08:37 This is a file name with eighteen spaces.jar*
-rwxrwxrwx 1 sn sn 73687 2007-05-11 08:37 ibmjcefw.jar*
-rwxrwxrwx 1 sn sn 767101 2007-05-11 08:37 ibmjceprovider.jar*

Question is: How do I get the file names with preserving the white spaces in between?
Note that the file has no embedded FS character, then it is just $8 and the problem is over. If the file name has embedded multiple FS characters, then I just do not want to concatenate $8 FS $9 FS $10 (etc in a loop) but I also want to have the multiplicity of the FS characters preserved.
(something like that "read v1 v2 v3 v4 v5 v6 v7 fileName" would do).

Thanks.

-sn

ryandegreat25 · July 28, 2009, 1:55am

you can try to use cut

echo $line | cut -d' ' -f8-

shri_nath · July 28, 2009, 2:20am

I am sorry the line

-rwxrwxrwx 1 sn sn 22395 2007-05-11 08:37 This is a file name with eighteen spaces.jar*

in the original had multiple spaces in the name of the file (on the html posting here on the forum those got collapsed into single spaces :))

It is something like:
ThisXisXXaXXXfileXXXXnameXXXXwithXXXeighteen spaces.jar*

---------- Post updated at 11:20 PM ---------- Previous update was at 10:58 PM ----------

Hi ryandegreat25

Your quick and correct answer is appreciated. Yes I could use "cut", "read" etc., but all of these are shell (external/internal) commands.

But could we solve this inside the gawk script itself (I mean without calling other shell commands/scripts) ? I already have a gawk script in place that does some other things too. If it cannot be done, then I will have to do a "surgery" on the script and split it into possibly many scripts with "read" or "cut" piped in between.

Thanks.

-sn

ryandegreat25 · July 28, 2009, 2:23am

i see.. well maybe you could try narrowing the spaces by. I'm sure there are better suggestions out there.

echo $x | tr -s " " | cut -d' ' -f8-

I see lets wait reply from others

---------- Post updated at 02:23 PM ---------- Previous update was at 02:20 PM ----------

can you show us your script?

danmero · July 28, 2009, 2:53am

You know how to use [size] and [color] tags , now you have to learn to use [code] tags when you post sample data , like that your multiple space will not "collapse into a single space"

edidataguy · July 28, 2009, 4:34am

All the above commands will work.
Only thing is you will have to quote the echo.
Eg:

echo "$x" | cut -d' ' -f8-

Have you tried:

ls -l +d
and
ls +d

Newer versions accept them.

May be this will also help you:

xx='-rwxrwxrwx 1 sn sn 22395 2007-05-11 08:37 This is a file name     with  eighteen    spaces.jar*'
echo "$xx"  | sed 's/^.*[0-9] \(.*\)\*$/\1/'
Output:
This is a file name     with  eighteen    spaces.jar

I have removed the * also for you.

---------- Post updated at 03:34 AM ---------- Previous update was at 03:20 AM ----------

edidataguy:

All the above commands will work.
Only thing is you will have to quote the echo.
Eg:
echo "$x" | cut -d' ' -f8-
Have you tried:
ls -l +d
and
ls +d
Newer versions accept them.

May be this will also help you:
xx='-rwxrwxrwx 1 sn sn 22395 2007-05-11 08:37 This is a file name     with  eighteen    spaces.jar*'
echo "$xx"  | sed 's/^.*[0-9] $.*$\*$/\1/'
Output:
This is a file name     with  eighteen    spaces.jar
I have removed the * also for you.

You can get rid of most of your coding with this:

ls -ltr | sed -n '/^-/ s/^.*[0-9] \(.*\)$/\1/p'

radoulov · July 28, 2009, 4:47am

You know that the filename will be after the 7th field,
so you could do something like this:

gawk --posix 'NR > 2 && !/\/$/ {
  sub(/([^ \t]+[ \t]+){7}/,"")
  print
  }' infile

Which produces:

zsh-4.3.10[t]% cat infile
sms_snath_hp_C/CORE BUILD PREREQUISITE:
total 1556
drwxrwxrwx 2 sn sn 4096 2008-06-27 08:31 ./
drwxrwxrwx 13 sn sn 4096 2009-07-22 14:48 ../
-rwxrwxrwx 1 sn sn 15348 2007-05-11 08:37 This is a file name with seven spaces.jar*
-rwxrwxrwx 1 sn sn 22395 2007-05-11 08:37 This is a file    name with      eight spaces.jar*
-rwxrwxrwx 1 sn sn 73687 2007-05-11 08:37 ibmjcefw.jar*
-rwxrwxrwx 1 sn sn 767101 2007-05-11 08:37 ibmjceprovider.jar*
zsh-4.3.10[t]% gawk --posix 'NR > 2 && !/\/$/ {
  sub(/([^ \t]+[ \t]+){7}/,"")
  print
  }' infile
This is a file name with seven spaces.jar*
This is a file    name with      eight spaces.jar*
ibmjcefw.jar*
ibmjceprovider.jar*

If you don't want to modify the current record you can save it in a variable and then manipulate the saved record:

gawk --posix 'END {
  # print the filenames
  while (++i <= c) print fn
  }
{
  # build an array to hold the filenames
  if (NR > 2 && !/\/$/) {
    rec = $0; sub(/([^ \t]+[ \t]+){7}/,"", rec)
    fn[++c] = rec
    }
  }' infile

thanhdat · July 28, 2009, 5:31am

Maybe something like this could work for you:

ls -ltr | awk 'BEGIN{FS=" [[:digit:]][[:digit:]]:[[:digit:]][[:digit:]] "}{print $2}'

radoulov · July 28, 2009, 5:37am

Assuming the filenames do not contain the pattern you use as a field separator ...

thanhdat · July 28, 2009, 6:02am

yeah, you're right, the filename must not contain the FS �___�

rubin · July 28, 2009, 7:15am

Or making use of the fact that the files' timestamp is a pattern found first before the filenames and its length is fixed, another alternative would be:

awk '/^-[rwx-]/{ print substr($0,match($0,/[0-2][0-9]:[0-5][0-9]/)+6) }' file

shri_nath · July 28, 2009, 11:44am

Hi radoulov,

This is the best answer. I had also come to the same pattern, but albeit separately for each of the first seven fields. Your answer is even better. I am going to change it a little bit as follows:

     gsub(/^([^ \t]+[ \t]+){6}[^ \t]+ /, "", fileName);

for the obvious reason that fileName itself could start with a white space!
Could you suggest me a pattern that would also get rid of the very last (one or zero) characters from these: />*|@= ?

I tried

    gsub(/^([^ \t]+[ \t]+){6}[^ \t]+ (.+)[\/\>\*\|\@\=]{0,1}$/, "\\1", fileName);

but that did not work and so I am having the above left most seven fields removed first and then followed by another to gsub to remove the (zero or one) of those last charcters.
Thanks.

-sn

edidataguy · July 28, 2009, 2:05pm

shri_nath:

Hi radoulov,

This is the best answer. I had also come to the same pattern, but albeit separately for each of the first seven fields. Your answer is even better. I am going to change it a little bit as follows:
   gsub(/^([^ \t]+[ \t]+){6}[^ \t]+ /, "", fileName);
for the obvious reason that fileName itself could start with a white space!
Could you suggest me a pattern that would also get rid of the very last (one or zero) characters from these: />*|@= ?

I tried
   gsub(/^([^ \t]+[ \t]+){6}[^ \t]+ (.+)[\/\>\*\|\@\=]{0,1}$/, "\\1", fileName);
but that did not work and so I am having the above left most seven fields removed first and then followed by another to gsub to remove the (zero or one) of those last charcters.
Thanks.

-sn

Did you test this code? Does this not work for you?
It takes care of even the last bit thing, the *.

ls -ltr | sed -n '/^-/ s/^.*[0-9] \(.*\)$/\1/p'

summer_cherry · July 28, 2009, 10:48pm

i think perl can help you some

while(<DATA>){
	#print $1,"\n" if /EXL.*(KOSBND_EXC_[^ ]*)/;
	my @tmp = split(" ", $_, 8);
	print $tmp[7];
}
__DATA__
-rwxrwxrwx 1 sn sn 15348 2007-05-11 08:37 This is a file name with seven spaces.jar*
-rwxrwxrwx 1 sn sn 22395 2007-05-11 08:37 This is a file name with eighteen spaces.jar*
-rwxrwxrwx 1 sn sn 73687 2007-05-11 08:37 ibmjcefw.jar*
-rwxrwxrwx 1 sn sn 767101 2007-05-11 08:37 ibmjceprovider.jar*

radoulov · July 29, 2009, 4:35am

shri_nath:

[...]
Could you suggest me a pattern that would also get rid of the very last (one or zero) characters from these: />*|@= ?

I tried
   gsub(/^([^ \t]+[ \t]+){6}[^ \t]+ (.+)[\/\>\*\|\@\=]{0,1}$/, "\\1", fileName);
but that did not work and so I am having the above left most seven fields removed first and then followed by another to gsub to remove the (zero or one) of those last charcters.
[...]

Well,
if that was "exactly one" or "one or more" and not "zero or one" with the GNU awk extension gensub you could write something like this:

gawk --re-interval 'NR > 2 && !/\/$/ {
  print gensub(/([^ \t]+[ \t]+){6}[^ \t]+ (.*)[/>*|@= ?]$/, "\\2", 1)
   }' infile

Notice that I'm using the re-interval option, because the gensub extension is disabled in compatibility (posix) mode and we still need the re-interval functionality.

In your case I would do it in two steps:

gawk --posix 'NR > 2 && !/\/$/ {
  sub(/([^ \t]+[ \t]+){6}[^ \t]+ /, "")
  sub(/[/>*|@= ?]$/, "")
  print
   }' infile