sed newbie scripting assistance

mthespian · August 11, 2012, 7:17pm

Howdy folks,
I'm trying to craft a log file summarisation tool for an application that creates a lot of duplicate entries with only a different suffix to indicate point of execution. I thought I'd gotten close but I'm clearly missing something.

Here's a genericized version:
A text_file (infile_grocery.txt) with these contents.

milk skim fruit apple banana
milk skim fruit orange
milk skim fruit mango
milk skim fruit pomegranate
milk 2 percent fruit cherry tomato
milk 2 percent fruit peach
milk whole fruit pineapple
milk skim fruit strawberry raspberry
milk skim fruit strawberry rhubarb
milk whole fruit pineapple

What I'm hoping to get is:

milk skim fruit apple banana, orange, mango, pomegranate
milk 2 percent fruit cherry tomato, peach
milk whole fruit pineapple
milk skim fruit strawberry raspberry, strawberry rhubarb
milk whole fruit pineapple

The command line I've cooked up is:

sed -rn "{H;x;s|^(.+) fruit ([^\n]+)\n(.*)\1 fruit (.+)$|\1 fruit \2, \4|;x}; ${x;s/^\n//;p}" infile_grocery.txt

But the results I'm getting are:

milk skim fruit apple banana, mango, strawberry raspberry
milk skim fruit strawberry rhubarb
milk whole fruit pineapple

Clearly I'm skipping chunks of lines somehow but I've been staring at this too long and I can't see it. Anyone have any suggestions for me?

agama · August 11, 2012, 11:55pm

If you aren't absolutely set on using sed, I think it's easier with awk:

awk '
    function printlist()
    {
        sub( ", $", "", list );
        printf( "%sfruit %s\n", last, list );
        list = "";
    }

    {
        x = $0;
        sub( "fruit.*", "", x );
        gsub( ".*fruit ", "", $0 );
        if( list && x != last )
            printlist();
        list = list $0 ", ";
        last = x;
    }
    END {
        if( list )
            printlist();
    }
 ' input-file >output-file

mthespian · August 13, 2012, 1:41pm

Won't this awk solution have a problem with very large files? To my eye, it looks like it's trying to load the whole thing into memory first...

If that's the case, couldn't it be problematic since the log files have already gotten to 500 MB in a half day? (partly why I'm looking to summarise duplicate content)

Personally I was leaning more toward sed just because it's a lighterweight install for the pc platform and can hopefully be invoked with a single commandline (aiming to shell out of notepad++, modify the buffer, and reload)

agama · August 13, 2012, 9:55pm

It's not trying to load the whole file in memory. It caches one copy of the first bits of a new line (things up to "fruit") and the list of 'items' that follow. When the first bits change, the record, with the summary of items, is written and the caching/list starts anew (list = ""). So, the only significant amount of "stuff" that is ever held in memory is the list of items.

Now, if that list is huge, then the programme could be amended to write them out as it finds them. My assumption was that the list wasn't going to be more than 1 or 2 K.

---------- Post updated at 21:55 ---------- Previous update was at 21:42 ----------

Turns out that printing the list as we go is a simpler programme; just didn't see it that way the other night.

awk '
    {
        x = $0;
        sub( "fruit.*", "", x );
        gsub( ".*fruit ", "", $0 );
         if( x != last )       # if first bits are different, print newline (if needed) and the current line
            printf( "%s%sfruit %s", NR > 1 ? "\n" : "", x, $0 );
         else         # first bits are the same, print just what is after fruit
            printf( "%s%s", NR > 1 ? ", " : "", $0 );
        last = x;
    }
     END { printf( "\n" );  }     # must have final newline
' input-file