awk reading from named pipe (fifo)

nathanhaigh · August 29, 2011, 9:45pm

I'm trying to read a fifo using awk and comming across some problems. I'm writing to the fifo from multiple processes invoked by GNU Parallel:

mkfifo my_fifo
awk '{ a[$1] = a[$1] + $2 } END { for (i in a) print i, a }' my_fifo | sort -nk1 > sorted_output
grep -v '^@' massive_file | parallel --max-procs 16 --pipe -N 2500 a_program -h my_fifo > stdout 2> stderr

Reading from the fifo appears to stop prematurely. If I execute the awk command again, the command writing to the fifo appears to continue. Is this an issue with awk reading from fifo's or the fact I'm writing to the fifo from multiple processes invoked through GNU Parallel?

Cheers,
Nathan

hfreyer · August 30, 2011, 9:27am

The problem seems to be, what happens if the program reading from the pipe, does not get data every time it would like to have them. I am not sure how to come around this in awk, but using perl you could try the following. Create a perl script "namedpipe.pl":

#!/usr/bin/perl
$tmax=5; # timeout after 5 seconds without data from pipe
open(IN,"< my_fifo") || die("cannot open my_fifo"); # blocks until first data available
$tstart=time();
LOOP: while(1){
  $_=<IN>; # non-blocking read, undef if no data available
  if (length) {
    # line contains something, process it
    chomp;
    @f=split;
    $a{@f[0]} += @f[1];
    # reset timestamp
    $tstart=time();
  }
  # leave loop on timeout
  last LOOP if (time()-$tstart>$tmax);
}
# final print with sorting
foreach $key (sort(keys(%a))) {
  print "$key $a{$key}\n"
}

and then use it like:

mkfifo my_fifo
perl namedpipe.pl > sorted_output &
grep -v '^@' massive_file | parallel --max-procs 16 --pipe -N 2500 a_program -h my_fifo > stdout 2> stderr; wait

tange · August 30, 2011, 10:35am

If you cannot accept that a few lines will be mixed together then you need to avoid race conditions like:

mkfifo fifo
(echo program1_line1; sleep 2; echo program1_line2) >fifo &
(echo program2_line1; sleep 1; echo program2_line2) >fifo &
cat fifo

In this case the lines do not mix up but there is no guarantee against that.

GNU Parallel guarantees the output from GNU Parallel will never be mixed up, but that requires that you can get the output to stdout:

grep -v '^@' massive_file | parallel --max-procs 16 --pipe -N 2500 a_program -h --no-debug-on-stdout - 2> stderr |
  awk '{ a[$1] = a[$1] + $2 } END { for (i in a) print i, a }' | sort -nk1 > sorted_output

If 'a_program' cannout output to stdout, you should be able to do this:

grep -v '^@' massive_file | parallel --max-procs 16 --pipe -N 2500 'mkfifo out_{#}; a_program -h out_{#} >stdout_{#} 2> stderr{#} & cat out_{#}' |
  awk '{ a[$1] = a[$1] + $2 } END { for (i in a) print i, a }' | sort -nk1 > sorted_output

That will create a fifo for each job, save the output to the fifo while cat'ting it out from the fifo. GNU Parallel will then catch the output and send it to awk when the job is done.

I have the feeling we are talking a lot of data coming into and out from 'a_program' and that you would prefer not having temporary files (which GNU Parallel will use for buffering the output). In that case consider putting the awk script into parallel with a_program.

The awk script seems to count the frequency of a given input and it should not be too hard to merge several outputs from the awk script.

nathanhaigh · August 30, 2011, 11:02pm

tange:

If 'a_program' cannout output to stdout, you should be able to do this:
grep -v '^@' massive_file | parallel --max-procs 16 --pipe -N 2500 'mkfifo out_{#}; a_program -h out_{#} >stdout_{#} 2> stderr{#} & cat out_{#}' |
  awk '{ a[$1] = a[$1] + $2 } END { for (i in a) print i, a }' | sort -nk1 > sorted_output
That will create a fifo for each job, save the output to the fifo while cat'ting it out from the fifo. GNU Parallel will then catch the output and send it to awk when the job is done.

I have the feeling we are talking a lot of data coming into and out from 'a_program' and that you would prefer not having temporary files (which GNU Parallel will use for buffering the output). In that case consider putting the awk script into parallel with a_program.

Hi Ole,

You are correct, we are talking 10's millions of lines going into 'a_program' so there are going to be 5,000-10,000 intermediary files created. How do I go about putting the awk script in parallel with 'a_program'?

You are correct, the file written by 'a_program' is a 2 column file, for each value in the first column I increment the count by the corresponding value in the send column. I then simply do a numerical sort of the lines by the first column.

Cheers!
Nathan