Problem piping find output to awk, 1st line filename is truncated, other lines are fine.

gencon · August 25, 2013, 11:21am

Today I needed to take a look through a load of large backup files, so I wrote the following line to find them, order them by size, and print the file sizes in GB along with the filename. What happened was odd, the output was all as expected except for the first output line which had the filename heavily truncated. I thought the problem might be with that particular file name, so I reversed the sort order, again the first filename was heavily truncated, this time a different file which had been listed correctly before I changed the ordering. [Note: I used '' as a field seperator for awk, none of the filenames contain the string ''.]

find . -type f -size +500M -printf '%s**%p\n' | sort -n | awk 'FS="**" {gb=$1/(2^30); printf("%f GB\t%s\n", gb, $2)}'

So I wrote the script below to create some dirs and files with different lengths and simplified the command line. Please have a look at what happens with the different find commands below, can someone explain why the first line always has the filename truncated as I can't work out why it is. Thanks.

#!/bin/bash

mkdir "Test Dir 1"
echo "Test File 1 extra chars so diff file lengths" > "Test Dir 1/Test File 1"

mkdir "Test Dir 2"
echo "Test File 2 fewer extra chars" > "Test Dir 2/Test File 2"

mkdir "Test Dir 3"
echo "Test File 3 even fewer" > "Test Dir 3/Test File 3"

mkdir "Test Dir 4"
echo "Test File 4 a few" > "Test Dir 4/Test File 4"

# Test 1 - no piping - All OK:

$ find . -type f -printf '%s**%p\n'
23**./Test Dir 3/Test File 3
30**./Test Dir 2/Test File 2
18**./Test Dir 4/Test File 4
45**./Test Dir 1/Test File 1


# Test 2 - pipe to sort - All OK:

$ find . -type f -printf '%s**%p\n' | sort -n
18**./Test Dir 4/Test File 4
23**./Test Dir 3/Test File 3
30**./Test Dir 2/Test File 2
45**./Test Dir 1/Test File 1


# Test 3 - pipe to awk - First line filename truncated:

$ find . -type f -printf '%s**%p\n' | awk 'FS="**" {printf("%d \t%s\n", $1, $2)}'
23     Dir
30     ./Test Dir 2/Test File 2
18     ./Test Dir 4/Test File 4
45     ./Test Dir 1/Test File 1


# Test 4 - pipe to sort, then to awk - First line filename truncated:

$ find . -type f -printf '%s**%p\n' | sort -n | awk 'FS="**" {printf("%d \t%s\n", $1, $2)}'
18     Dir
23     ./Test Dir 3/Test File 3
30     ./Test Dir 2/Test File 2
45     ./Test Dir 1/Test File 1


# Test 5 - pipe to reverse sort, then to awk - First line filename truncated:

$ find . -type f -printf '%s**%p\n' | sort -nr | awk 'FS="**" {printf("%d \t%s\n", $1, $2)}'
45     Dir
30     ./Test Dir 2/Test File 2
23     ./Test Dir 3/Test File 3
18     ./Test Dir 4/Test File 4

Thanks all.

Scrutinizer · August 25, 2013, 12:03pm

The input field separator in awk needs to be specified before the first line. It is also probably a good idea to remove the special meaning of the asterisks. Try:

awk -F'[*][*]' '{printf ....

gencon · August 25, 2013, 12:44pm

Thanks Scrutinizer, that works.

I thought that anything before the awk {} was before the first line. Oops, so obvious once you know.

To anyone who's interested, this also works:

$ find . -type f -printf '%s**%p\n' | sort -nr | awk 'BEGIN {FS="**"} {printf("%d \t%s\n", $1, $2)}'
45     ./Test Dir 1/Test File 1
30     ./Test Dir 2/Test File 2
23     ./Test Dir 3/Test File 3
18     ./Test Dir 4/Test File 4

as does...

$ find . -type f -printf '%s**%p\n' | sort -nr | awk 'BEGIN {FS="
[*]
[*]"} {printf("%d \t%s\n", $1, $2)}'

Scrutinizer · August 25, 2013, 1:07pm

Glad it helps

Note: FS="**" only works in some awks, where it may happen to mean "zero or more asterisks" .

However, this is not defined behaviour..

So it would be best to either use:

FS="[*]*"    # for zero or more asterisks

or

FS="[*][*]"  # for exactly two asterisks...

Regular Expressions: ERE Special Characters

gencon · August 25, 2013, 1:14pm

Thanks again.

I'll take your advice and use that instead in the future. I've used '**' as a field separator for (my system's) awk in the past so I knew it worked. It's clearly a bad habit which I will stop doing.