Hello,
I want to split a big file into smaller ones with certain "counts". I am aware this type of job has been asked quite often, but I posted again when I came to csplit, which may be simpler to solve the problem.
Input file (fasta format):
I want output file by sequence counts (say 2, i.e. each output file will contain 2 sequences). Normally the input file can be hundreds of entries with different length of each row and different number of rows, but with the same ">" delimiter for sure. Recent fastq file can be millions of entries (or short reads, but with @ as the delimiter, if not familiar with DNA sequence).
OUTFILE1
>seq1
agtcagtc
agtcagtc
ag
>seq2
agtcagtcagtc
agtcagtcagtc
agtcagtc
AWK can do this job easily, also there are many perl/bioperl, python/biopython to do this job. It seems to me csplit could be a simpler way as a single command.
I gave it tries like following, but did not work out correctly:
The problem here is ARG1 is the number of lines for csplit, but my case is each sequence may have different lines so that the counts I want is not proportional to the line number. ARG2 is number of output files, or repeat times for csplit. Thanks a bunch!
yifang
Here is a demonstration of a gathering technique that might be useful here:
#!/usr/bin/env bash
# @(#) s1 Demonstrate gather of lines, split into groups, expand.
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C awk tail split
FILE=${1-data1}
pl " Input data file $FILE:"
cat $FILE
# Remove debris.
rm -f xa*
pl " Script \"gather\":"
cat gather
# Bunch lines of each sequence together on a single line.
./gather $FILE |
tail --lines=+2 |
split --lines=2
pl " Files xa created by split:"
ls xa*
# Expand file to temporary, replace original.
for file in xa*
do
awk '
{ gsub(/=/,"\n") ; printf("%s",$0) }
' $file > t1
mv t1 $file
done
pl " Sample: file xab expanded:"
cat xab
exit 0
producing
% ./s1
Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution : Debian GNU/Linux 5.0.8 (lenny)
bash GNU bash 3.2.39
awk GNU Awk 3.1.5
tail (GNU coreutils) 6.10
split (GNU coreutils) 6.10
-----
Input data file data1:
>seq1
agtcagtc
agtcagtc
ag
>seq2
agtcagtcagtc
agtcagtcagtc
agtcagtc
>seq3
agtcagtcagtcagtc
agtcagtcagtcagtc
agtcagtcagtcagtc
>seq4
agtcagtcagtcagtcagtcagtc
>seq5
agtcagtc
>seq6
agtcagtcagtcagtcagtcagtcagtcagtcagtc
agtcagtcagtcagtcagtcagtcagtcagtcagtc
agtcagtcagtcagtcagtcagtc
-----
Script "gather":
#!/usr/bin/env bash
# @(#) gather Substitution of newline.
FILE=${1-data1}
awk '
BEGIN { RS = ">" }
{ gsub(/\n/,"=") ; printf("%s%s\n", RS,$0) }
' $FILE
exit 0
-----
Files xa created by split:
xaa xab xac
-----
Sample: file xab expanded:
>seq3
agtcagtcagtcagtc
agtcagtcagtcagtc
agtcagtcagtcagtc
>seq4
agtcagtcagtcagtcagtcagtc
The short awk script in file gather collects lines belonging to a sequence into a super line. The newlines are replaced by some character not in the data, here I used "=".
Then the super lines are split into groups of 2.
Another awk script expands the super lines by replacing "=" with a real newline and re-writes the files.
Thanks drl!
Your script seems quite different from regular ones. Could you please explain it in more detail to give me some education? Thanks a lot!
yt
There is some preliminary coding to display my environment. The core of the solution is:
bunch input file into super lines -> split into separate files of n (=2 in this demo) -> expand the individual files
The bunching is done with an awk script, gather. Each "bunch" starts with (is separated by), RS=">", the Record Separator. For all the newlines in the bunch, replace the newlines with "=". That creates "super lines".
The split works on lines, so that it does not know that there are really many lines in a super line. For each group of n (=2) super lines, split will write to a new file, xaa, xab, etc.
For all those files xa*, replace "=" with newlines, and re-write the files.
Note that this is a demonstration of a technique. There are certainly other solutions.
The demonstration script assumes groups of 2, it is not generalized. You would need to change the split command or modify the script to accept another parameter in addition to the input file name to obtain a different number of sequences in the final set of files.
However, the script as it stands will accept (but ignore) parameters beyond the file name. For example, we can copy the default input file file name, data1, to another file, my-input-test, and run that:
% cp data1 my-input-test
% ./s1 my-input-test 324
producing the output:
% ./s1 my-input-test 324
Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution : Debian GNU/Linux 5.0.8 (lenny)
bash GNU bash 3.2.39
awk GNU Awk 3.1.5
tail (GNU coreutils) 6.10
split (GNU coreutils) 6.10
-----
Input data file my-input-test:
>seq1
( ... lines deleted )
-----
Sample: file xab expanded:
>seq3
agtcagtcagtcagtc
agtcagtcagtcagtc
agtcagtcagtcagtc
>seq4
agtcagtcagtcagtcagtcagtc