(g)awk how to preseve white spaces (FS characters) or read a right subpart of $0?

I am using gawk (--posix) for extracting some information from something like the following lines (in a text file):

total 1556
drwxrwxrwx 2 sn sn 4096 2008-06-27 08:31 ./
drwxrwxrwx 13 sn sn 4096 2009-07-22 14:48 ../
-rwxrwxrwx 1 sn sn 15348 2007-05-11 08:37 This is a file name with seven spaces.jar*
-rwxrwxrwx 1 sn sn 22395 2007-05-11 08:37 This is a file name with eight spaces.jar*
-rwxrwxrwx 1 sn sn 73687 2007-05-11 08:37 ibmjcefw.jar*
-rwxrwxrwx 1 sn sn 767101 2007-05-11 08:37 ibmjceprovider.jar*

With regular expressions (pattern matching) I am ignoring all the lines except the ones which are NOT directories with long listing format.

So I consider only:
-rwxrwxrwx 1 sn sn 15348 2007-05-11 08:37 This is a file name with seven spaces.jar*
-rwxrwxrwx 1 sn sn 22395 2007-05-11 08:37 This is a file name with eighteen spaces.jar*
-rwxrwxrwx 1 sn sn 73687 2007-05-11 08:37 ibmjcefw.jar*
-rwxrwxrwx 1 sn sn 767101 2007-05-11 08:37 ibmjceprovider.jar*

Question is: How do I get the file names with preserving the white spaces in between?
Note that the file has no embedded FS character, then it is just $8 and the problem is over. If the file name has embedded multiple FS characters, then I just do not want to concatenate $8 FS $9 FS $10 (etc in a loop) but I also want to have the multiplicity of the FS characters preserved.
(something like that "read v1 v2 v3 v4 v5 v6 v7 fileName" would do).



you can try to use cut

echo $line | cut -d' ' -f8-

I am sorry the line

-rwxrwxrwx 1 sn sn 22395 2007-05-11 08:37 This is a file name with eighteen spaces.jar*

in the original had multiple spaces in the name of the file (on the html posting here on the forum those got collapsed into single spaces :))

It is something like:
ThisXisXXaXXXfileXXXXnameXXXXwithXXXeighteen spaces.jar*

---------- Post updated at 11:20 PM ---------- Previous update was at 10:58 PM ----------

Hi ryandegreat25

Your quick and correct answer is appreciated. Yes I could use "cut", "read" etc., but all of these are shell (external/internal) commands.

But could we solve this inside the gawk script itself (I mean without calling other shell commands/scripts) ? I already have a gawk script in place that does some other things too. If it cannot be done, then I will have to do a "surgery" on the script and split it into possibly many scripts with "read" or "cut" piped in between.



i see.. well maybe you could try narrowing the spaces by. I'm sure there are better suggestions out there.

echo $x | tr -s " " | cut -d' ' -f8-

I see lets wait reply from others :slight_smile:

---------- Post updated at 02:23 PM ---------- Previous update was at 02:20 PM ----------

can you show us your script?

You know how to use [size] and [color] tags , now you have to learn to use [code] tags when you post sample data , like that your multiple space will not "collapse into a single space" :wink:

All the above commands will work.
Only thing is you will have to quote the echo.

echo "$x" | cut -d' ' -f8-

Have you tried:

ls -l +d
ls +d

Newer versions accept them.

May be this will also help you:

xx='-rwxrwxrwx 1 sn sn 22395 2007-05-11 08:37 This is a file name     with  eighteen    spaces.jar*'
echo "$xx"  | sed 's/^.*[0-9] \(.*\)\*$/\1/'
This is a file name     with  eighteen    spaces.jar

I have removed the * also for you.

---------- Post updated at 03:34 AM ---------- Previous update was at 03:20 AM ----------

You can get rid of most of your coding with this:

ls -ltr | sed -n '/^-/ s/^.*[0-9] \(.*\)$/\1/p'

You know that the filename will be after the 7th field,
so you could do something like this:

gawk --posix 'NR > 2 && !/\/$/ {
  sub(/([^ \t]+[ \t]+){7}/,"")
  }' infile

Which produces:

zsh-4.3.10[t]% cat infile
total 1556
drwxrwxrwx 2 sn sn 4096 2008-06-27 08:31 ./
drwxrwxrwx 13 sn sn 4096 2009-07-22 14:48 ../
-rwxrwxrwx 1 sn sn 15348 2007-05-11 08:37 This is a file name with seven spaces.jar*
-rwxrwxrwx 1 sn sn 22395 2007-05-11 08:37 This is a file    name with      eight spaces.jar*
-rwxrwxrwx 1 sn sn 73687 2007-05-11 08:37 ibmjcefw.jar*
-rwxrwxrwx 1 sn sn 767101 2007-05-11 08:37 ibmjceprovider.jar*
zsh-4.3.10[t]% gawk --posix 'NR > 2 && !/\/$/ {
  sub(/([^ \t]+[ \t]+){7}/,"")
  }' infile
This is a file name with seven spaces.jar*
This is a file    name with      eight spaces.jar*

If you don't want to modify the current record you can save it in a variable and then manipulate the saved record:

gawk --posix 'END {
  # print the filenames
  while (++i <= c) print fn
  # build an array to hold the filenames
  if (NR > 2 && !/\/$/) {
    rec = $0; sub(/([^ \t]+[ \t]+){7}/,"", rec)
    fn[++c] = rec
  }' infile

Maybe something like this could work for you:

ls -ltr | awk 'BEGIN{FS=" [[:digit:]][[:digit:]]:[[:digit:]][[:digit:]] "}{print $2}'

Assuming the filenames do not contain the pattern you use as a field separator ...

yeah, you're right, the filename must not contain the FS �___�

Or making use of the fact that the files' timestamp is a pattern found first before the filenames and its length is fixed, another alternative would be:

awk '/^-[rwx-]/{ print substr($0,match($0,/[0-2][0-9]:[0-5][0-9]/)+6) }' file

Hi radoulov,

This is the best answer. I had also come to the same pattern, but albeit separately for each of the first seven fields. Your answer is even better. I am going to change it a little bit as follows:

     gsub(/^([^ \t]+[ \t]+){6}[^ \t]+ /, "", fileName);

for the obvious reason that fileName itself could start with a white space!
Could you suggest me a pattern that would also get rid of the very last (one or zero) characters from these: />*|@= ?

I tried

    gsub(/^([^ \t]+[ \t]+){6}[^ \t]+ (.+)[\/\>\*\|\@\=]{0,1}$/, "\\1", fileName);

but that did not work and so I am having the above left most seven fields removed first and then followed by another to gsub to remove the (zero or one) of those last charcters.


Did you test this code? Does this not work for you?
It takes care of even the last bit thing, the *.

ls -ltr | sed -n '/^-/ s/^.*[0-9] \(.*\)$/\1/p'

i think perl can help you some

	#print $1,"\n" if /EXL.*(KOSBND_EXC_[^ ]*)/;
	my @tmp = split(" ", $_, 8);
	print $tmp[7];
-rwxrwxrwx 1 sn sn 15348 2007-05-11 08:37 This is a file name with seven spaces.jar*
-rwxrwxrwx 1 sn sn 22395 2007-05-11 08:37 This is a file name with eighteen spaces.jar*
-rwxrwxrwx 1 sn sn 73687 2007-05-11 08:37 ibmjcefw.jar*
-rwxrwxrwx 1 sn sn 767101 2007-05-11 08:37 ibmjceprovider.jar*

if that was "exactly one" or "one or more" and not "zero or one" with the GNU awk extension gensub you could write something like this:

gawk --re-interval 'NR > 2 && !/\/$/ {
  print gensub(/([^ \t]+[ \t]+){6}[^ \t]+ (.*)[/>*|@= ?]$/, "\\2", 1)
   }' infile

Notice that I'm using the re-interval option, because the gensub extension is disabled in compatibility (posix) mode and we still need the re-interval functionality.

In your case I would do it in two steps:

gawk --posix 'NR > 2 && !/\/$/ {
  sub(/([^ \t]+[ \t]+){6}[^ \t]+ /, "")
  sub(/[/>*|@= ?]$/, "")
   }' infile