Attach filename to wc results on massive number of files

yifangt · March 14, 2019, 1:22pm

Hello,
I have massive number of big files that needed to be counted for the total number of lines (> 100x millions) each. I want the file name attached to the count results so that they are aligned nicely matching name and counts.
I could do each file at a time, which will take hours to finish, so that the jobs were sent to background as I have multiple cores available to get the job done quickly. The problem with my script is the "echo -n $f" "; always accomplishes first, and the ${f}_R1.fq.gz | wc -l part is behind too much and the result was not aligned as expected.

Here is my code:

for f in $(cat ${LIST1}); do 
echo -n $f" "  >> raw_reads_count.table1; 
zcat ${f}_R1.fq.gz | wc -l >> raw_reads_count.table1 &      #This is the part
 done
------------------------------------------------------------------------------------------------------
messed-up output:
a      
bb    
ccc   
xyz 
267234214
777234211
937214233
1027254258
------------------------------------------------------------------------------------------------------
 Expected output:
a    267234214
bb   937214233
ccc  777234211
xyz 1027254258

How should I improve my script to get what is expected? Thanks a lot!

RudiC · March 14, 2019, 1:26pm

How about

{ echo -n $f" "; zcat ${f}_R1.fq.gz | wc -l; } >> raw_reads_count.table1 &

Should that fail, write to single files in sequence, then, after the loop, concatenate the files.

yifangt · March 14, 2019, 1:53pm

@Rudic
No, still the same as the original problem.
I'll do the single files and then concatenate them. Thanks!

Don_Cragun · March 14, 2019, 2:31pm

How about something more like:

LIST1=/what/ever/you/want
OUTPUT1=raw_reads_count.table1

while read -r f
do	(	linecount=$(zcat ${f}_R1.fq.gz | wc -l)
		printf '%s\t%s\n' "$f" "$linecount" >> "$OUTPUT1"
	)&
done < "$LIST1"
wait
printf '%s: %s is ready.\n' "${0##*/}" "$OUTPUT1"

bakunin · March 14, 2019, 2:46pm

Actually this is a very interesting problem. It is hard simulate without actually create some terabytes of files that are similar in size to what you have to process, therefore, before i start to actually do that, i'd like to offer a few theories first which you may verify:

my suspicion is that the problem is the buffered nature of <stdout>. From time to time this buffer is flushed and because the output of echo is available already it gets written into the file but since the zcat still runs at that time it will be written at a much later time. Maybe the following might help. I used printf instead of echo , but that is not the point: to execute the output statement the subshell has to be finished, therefore the line should get printed completely or not at all. Because the whole process gets put in background the original order of the filenames will no longer be retained - maybe no concern to you but you should be aware of that.

Another point is the number of processes you start: starting an (in principle unlimited) amount of background processes at the same time is always a bit of an hazard. The script might work well with 10 or 20 files generating 10 or 20 background processes but a directory may as well hold millions of files. No system would survive an attempt to start a million background processes, no matter how small they are and how many processors you have. You may want to implement some logic to only have some maximum number of bround processes running concurrently.

$(printf "%s\t%s\n" "$f" $(zcat ${f}_R1.fq.gz | wc -l) ) >> raw_reads_count.table1 &

I hope this helps.

bakunin

yifangt · March 15, 2019, 12:46pm

@bakunin @all
Your comments are exactly what I wanted to catch. Here, I reformed my script with GNU parallel to control the process limits, but I hit another wall:

parallel -a $LIST1 -j 48 "(printf "%s\t%s\n" {} $(zcat {}_R1.fq.gz | wc -l)) >> raw_reads_count.table1"
------------------------------------------------------
a 0 >> raw_reads_count.table1
bb 0 >> raw_reads_count.table1
ccc 0 >> raw_reads_count.table1
xyz 0 >> raw_reads_count.table1

The problem seems with the parallel placeholder expansion. Is it because of the too many layers of parenthesis () ? Need to get myself familiar with quoting in bash.
Thanks for any help!

It seems to me this is the final solution:

parallel -a $LIST1 -j 48 "(echo -n {}' '; (zcat ${RAW_DIR1}/{}_R1.fq.gz | wc -l)) > {}_counts.tmp"
cat *_counts.tmp >> raw_reads_count.table1

Thanks you all for the help!

Corona688 · March 15, 2019, 4:12pm

Parallel is not a go-faster button for files. Unless your CPU is maxing out, there's no benefit.

GNU parallel is just doing individual files like you were doing anyway. It has to, lacking magic mechanisms to predict future filesize and move things where they belong.

If your CPU is maxing out, pigz may work faster one-at-a-time than you were trying to do parallel.

yifangt · March 15, 2019, 4:23pm

Thanks!
Original question is about the synchronizing of echo and wc -l output, but this definitely enhanced more than I expected. I just installed pigz, and will give it try soon.

Corona688 · March 15, 2019, 4:29pm

Except that they are synchronized. The problem was always all your other processes stomping on your file simultaneously.

Attach filename to wc results on massive number of files

The problem seems with the parallel placeholder expansion. Is it because of the too many layers of parenthesis () ? Need to get myself familiar with quoting in bash. Thanks for any help!

The problem seems with the parallel placeholder expansion. Is it because of the too many layers of parenthesis () ? Need to get myself familiar with quoting in bash.
Thanks for any help!