split file by delimiter with csplit

yifangt · July 31, 2012, 1:54pm

Hello,
I want to split a big file into smaller ones with certain "counts". I am aware this type of job has been asked quite often, but I posted again when I came to csplit, which may be simpler to solve the problem.
Input file (fasta format):

>seq1
agtcagtc
agtcagtc
ag
>seq2
agtcagtcagtc
agtcagtcagtc
agtcagtc
>seq3
agtcagtcagtcagtc
agtcagtcagtcagtc
agtcagtcagtcagtc
>seq4
agtcagtcagtcagtcagtcagtc
>seq5
agtcagtc
>seq6
agtcagtcagtcagtcagtcagtcagtcagtcagtc
agtcagtcagtcagtcagtcagtcagtcagtcagtc
agtcagtcagtcagtcagtcagtc

I want output file by sequence counts (say 2, i.e. each output file will contain 2 sequences). Normally the input file can be hundreds of entries with different length of each row and different number of rows, but with the same ">" delimiter for sure. Recent fastq file can be millions of entries (or short reads, but with @ as the delimiter, if not familiar with DNA sequence).
OUTFILE1

>seq1
agtcagtc
agtcagtc
ag
>seq2
agtcagtcagtc
agtcagtcagtc
agtcagtc

OUTFILE2

>seq3
agtcagtcagtcagtc
agtcagtcagtcagtc
agtcagtcagtcagtc
>seq4
agtcagtcagtcagtcagtcagtc

OUTFILE3

>seq5
agtcagtc
>seq6
agtcagtcagtcagtcagtcagtcagtcagtcagtc
agtcagtcagtcagtcagtcagtcagtcagtcagtc
agtcagtcagtcagtcagtcagtc

AWK can do this job easily, also there are many perl/bioperl, python/biopython to do this job. It seems to me csplit could be a simpler way as a single command.
I gave it tries like following, but did not work out correctly:

csplit -s -f --prefix=part INFILE.fasta ARG1 {ARG2}

The problem here is ARG1 is the number of lines for csplit, but my case is each sequence may have different lines so that the counts I want is not proportional to the line number. ARG2 is number of output files, or repeat times for csplit. Thanks a bunch!
yifang

Chirel · July 31, 2012, 2:30pm

Hi

Try this

csplit -s -f part INFILE.fasta  '/^>/' {*}

This should create lot's of partxx files.

yifangt · July 31, 2012, 3:19pm

Thanks Chirel!
I did not mean to create one sequence a file, but every 500 (for example) sequences a file. What is the ARG for that?

drl · August 1, 2012, 6:00am

Hi.

Here is a demonstration of a gathering technique that might be useful here:

#!/usr/bin/env bash

# @(#) s1	Demonstrate gather of lines, split into groups, expand.

pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C awk tail split

FILE=${1-data1}

pl " Input data file $FILE:"
cat $FILE

# Remove debris.
rm -f xa*

pl " Script \"gather\":"
cat gather

# Bunch lines of each sequence together on a single line.
./gather $FILE |
tail --lines=+2 |
split --lines=2

pl " Files xa created by split:"
ls xa*

# Expand file to temporary, replace original.
for file in xa*
do
awk '
	{ gsub(/=/,"\n") ; printf("%s",$0) }
' $file > t1
mv t1 $file
done

pl " Sample: file xab expanded:"
cat xab

exit 0

producing

% ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
awk GNU Awk 3.1.5
tail (GNU coreutils) 6.10
split (GNU coreutils) 6.10

-----
 Input data file data1:
>seq1
agtcagtc
agtcagtc
ag
>seq2
agtcagtcagtc
agtcagtcagtc
agtcagtc
>seq3
agtcagtcagtcagtc
agtcagtcagtcagtc
agtcagtcagtcagtc
>seq4
agtcagtcagtcagtcagtcagtc
>seq5
agtcagtc
>seq6
agtcagtcagtcagtcagtcagtcagtcagtcagtc
agtcagtcagtcagtcagtcagtcagtcagtcagtc
agtcagtcagtcagtcagtcagtc

-----
 Script "gather":
#!/usr/bin/env bash

# @(#) gather	Substitution of newline.

FILE=${1-data1}

awk '
BEGIN	{ RS = ">" }
	{ gsub(/\n/,"=") ; printf("%s%s\n", RS,$0) }
' $FILE

exit 0

-----
 Files xa created by split:
xaa  xab  xac

-----
 Sample: file xab expanded:
>seq3
agtcagtcagtcagtc
agtcagtcagtcagtc
agtcagtcagtcagtc
>seq4
agtcagtcagtcagtcagtcagtc

The short awk script in file gather collects lines belonging to a sequence into a super line. The newlines are replaced by some character not in the data, here I used "=".

Then the super lines are split into groups of 2.

Another awk script expands the super lines by replacing "=" with a real newline and re-writes the files.

Best wishes ... cheers, drl

Chirel · August 1, 2012, 6:12am

I'm sorry but it's not possible with csplit to cut every 500 sequences. If it's lines you can, but sequences no.

I think you should go for an awk or perl or sed or other solution to you problem.

Like for exemple the answer from drl

yifangt · August 1, 2012, 1:59pm

Thanks drl!
Your script seems quite different from regular ones. Could you please explain it in more detail to give me some education? Thanks a lot!
yt

drl · August 1, 2012, 2:28pm

Hi.

There is some preliminary coding to display my environment. The core of the solution is:

bunch input file into super lines -> split into separate files of n (=2 in this demo) -> expand the individual files

The bunching is done with an awk script, gather. Each "bunch" starts with (is separated by), RS=">", the Record Separator. For all the newlines in the bunch, replace the newlines with "=". That creates "super lines".

The split works on lines, so that it does not know that there are really many lines in a super line. For each group of n (=2) super lines, split will write to a new file, xaa, xab, etc.

For all those files xa*, replace "=" with newlines, and re-write the files.

Note that this is a demonstration of a technique. There are certainly other solutions.

Viola! ... cheers, drl

yifangt · August 1, 2012, 2:36pm

What are the argument to run your script to process, for example, the input file with 6 sequences? I have tried

 ./script inputfile.fasta 2

but did not get the expected three files with 2 sequences each. Thanks!

drl · August 2, 2012, 5:25am

Hi.

The demonstration script assumes groups of 2, it is not generalized. You would need to change the split command or modify the script to accept another parameter in addition to the input file name to obtain a different number of sequences in the final set of files.

However, the script as it stands will accept (but ignore) parameters beyond the file name. For example, we can copy the default input file file name, data1, to another file, my-input-test, and run that:

% cp data1 my-input-test
% ./s1 my-input-test 324

producing the output:

% ./s1 my-input-test 324

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
bash GNU bash 3.2.39
awk GNU Awk 3.1.5
tail (GNU coreutils) 6.10
split (GNU coreutils) 6.10

-----
 Input data file my-input-test:
>seq1
( ... lines deleted )
-----
 Sample: file xab expanded:
>seq3
agtcagtcagtcagtc
agtcagtcagtcagtc
agtcagtcagtcagtc
>seq4
agtcagtcagtcagtcagtcagtc

Best wishes ... cheers, drl