How to control grep output intact for each matching line?

yifangt · October 10, 2018, 1:18pm

I have multiple (~80) files (some can be as big as 30GB of >1 billion of lines!) to grep on a pattern, and piped the match to a single file. I have a 96-core machine so that each grep job was sent to the background to speed up the search:

file1.tab
chr1A_part1    123241847    123241848
chr1A_part1    123241848    123241849
chr1A_part1    123241849    123241850
chr1A_part1    123241850    123241851
......

The input files have uniformly 3 fields each row, so should the output file,

for file in $(cat files.list); do 
grep -F chr1A ${file} >> subset_chr1A.tab &
done

but I found some of the matching lines are broken and the output file became a mess!

subset_chr1A.tab
chr1A_part1    123241847    123241848
chr1A_part1    123241848    123241849
chr1A_part1    1232
41849    123241850
ch1
chr1A_part1    12
3241850    
chr1A_part1    123441848    123441849
123541851
...

It seems to me the problem is from the writing of the pipe, as 80 grep jobs for 80 files are writing to the same output file. By default grep prints matching lines so that I assume each row should be printed as a whole, but it did not in my case.

What is wrong here?

Corona688 · October 10, 2018, 1:43pm

Buffering will make a mess of this, bundling arbitrary blocks into one write. These arbitrary blocks don't care much where lines begin and end. Long enough lines could conceivably take more than one write!

If you have GNU awk, --line-buffered may help, but will have a big performance cost.

You could also send the output to separate files and cat them together later.

yifangt · October 10, 2018, 1:51pm

I will do with the second suggestion. Thanks!

RudiC · October 10, 2018, 4:39pm

Why not forgo the loop?

grep -F chr1A file*.tab > subset_chr1A.tab

Corona688 · October 11, 2018, 11:19am

True, the limit is likely to be disk, not CPU.

yifangt · October 12, 2018, 2:16pm

Thanks Rudic!
Before I try your method, does this grep -F chr1A file*.tab swallow all the 80 files (~2400GB!) in memory first?

RudiC · October 12, 2018, 4:50pm

I don't think it consumes too much memory - it read the files line by line, grep s each, and drops, or outputs, it.