Hi.
Apologies for the length of this and for the late posting. I am always skeptical of shell solutions when we get to sizable files, 1M lines of more because of the time involved. I focused only on the time for reading by creating a test file of 1M lines, only with line content scaffold1 and scaffold2. Here is the script:
#!/usr/bin/env bash
# @(#) s1 Demonstrate schemes to split a file based on content.
# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C awk perl gate mmsplit
inxi -c0 -C
FILE=data1
FILE_tmp=/tmp/data1$$
trap 'rm -f $FILE_tmp ; exit 1' 0 1 2 15
rm -f file* scaffold*
# Create data file if it does not yet exist.
if [ ! -f $FILE ]
then
./create2
fi
pl " Input data file $FILE:"
specimen 2:2:2 -n $FILE
# Sample line:
# scaffold1 928 929 C/T +
pl " Results, shell, unsorted:"
time while read col1 rest; do echo "$col1 $rest" >> ${col1}.txt; done < $FILE
pe
wc scaffold*
rm scaffold*
pl " Results, awk, unsorted:"
time awk '!($1 in a){a[$1]="file"++c".txt"}{print $0 >>a[$1]; close(a[$1])}' $FILE
pe
wc file*
rm file*
pl " Results, sort the file:"
time sort -o $FILE_tmp $FILE
pe
specimen 2:2:2 -n $FILE_tmp
pl " Results, awk sorted:"
time awk '$1 != prev{if(f)close(f);f="file"++c".txt"; prev=$1}{print > f}END{if(f)close(f)}' $FILE_tmp
pe
wc file*
rm file*
pl " Results, gate, sorted:"
time gate -f=1 -s=" " $FILE_tmp
pe
wc scaffold*
rm scaffold*
pl " Results, mmsplit, sorted:"
time mmsplit --fix=every --body=body --grep='/^scaffold(\d+)/' -i=$FILE_tmp
pe
wc body*
rm body*
exit 0
producing:
$ ./s1
Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution : Debian 8.4 (jessie)
bash GNU bash 4.3.30
awk GNU Awk 4.1.1, API: 1.1 (GNU MPFR 3.1.2-p3, GNU MP 6.0.0)
perl 5.20.2
gate (local) 1.10
mmsplit (local) 2.0
CPU: Triple core AMD FX-6350 Six-Core (-MCP-) cache: 6144 KB
clock speeds: max: 3915 MHz 1: 3915 MHz 2: 3915 MHz 3: 3915 MHz
-----
Input data file data1:
Edges: 2:2:2 of 1000000 lines in file "data1"
1 scaffold1 928 929 C/T +
2 scaffold2 928 929 C/T +
---
500001 scaffold1 928 929 C/T +
500002 scaffold2 928 929 C/T +
---
999999 scaffold1 928 929 C/T +
1000000 scaffold2 928 929 C/T +
-----
Results, shell, unsorted:
real 0m26.607s
user 0m17.868s
sys 0m8.624s
500000 2500000 18000000 scaffold1.txt
500000 2500000 18000000 scaffold2.txt
1000000 5000000 36000000 total
-----
Results, awk, unsorted:
real 0m19.304s
user 0m5.892s
sys 0m13.308s
500000 2500000 21000000 file1.txt
500000 2500000 21000000 file2.txt
1000000 5000000 42000000 total
-----
Results, sort the file:
real 0m0.424s
user 0m0.416s
sys 0m0.176s
Edges: 2:2:2 of 1000000 lines in file "/tmp/data110702"
1 scaffold1 928 929 C/T +
2 scaffold1 928 929 C/T +
---
500001 scaffold2 928 929 C/T +
500002 scaffold2 928 929 C/T +
---
999999 scaffold2 928 929 C/T +
1000000 scaffold2 928 929 C/T +
-----
Results, awk sorted:
real 0m0.515s
user 0m0.420s
sys 0m0.092s
500000 2500000 21000000 file1.txt
500000 2500000 21000000 file2.txt
1000000 5000000 42000000 total
-----
Results, gate, sorted:
real 0m6.238s
user 0m6.144s
sys 0m0.092s
500000 2500000 21000000 scaffold1
500000 2500000 21000000 scaffold2
1000000 5000000 42000000 total
-----
Results, mmsplit, sorted:
real 0m2.918s
user 0m2.796s
sys 0m0.120s
500000 2500000 21000000 body.1
500000 2500000 21000000 body.2
1000000 5000000 42000000 total
Comments:
This isn't just a simple split, it's a split and group problem. Codes like csplit
at first glance might be considered, but it keys off a unique header-like value, then transfers lines until the next occurrence of a header.. We need to create multiple output files gathering lines that have similar key values.
I like the shell code because it is simple to understand, but it takes a long time.
The awk
unsorted version also takes a long time, and I think it's because of the large number of closes.
The awk
sorted version is very speedy and, when compared with the time for a sort seems like the best solution.
Our local perl codes gate
and mmsplit
are run for comparison. The gate
is slower, but is very simple to call.
The mmsplit
is faster than gate
, but has a more complicated calling sequence.
So I would choose the awk
sorted code from Akshay Hegde but precede it with a sort. The total real time coming in at 0.424+0.515 -> 0.939, is better than the other solutions.
The awk
unsorted could be improved by holding strings until one had, say 1000 of them, then writing the file and closing it. That would cut down the time, but increase the complexity.
The issue of the maximum number of open files might be a problem, although less so for the shell than the other scripting solutions. Solutions using the sorted file would probably be best for a large number of possible group values.
Best wishes ... cheers, drl