Deleting files that don't contain particular text strings / more than one instance of a string

kmkocot · November 4, 2009, 7:52pm

Hi all,

I have a directory containing many subdirectories each named like KOG#### where # represents any digit 0-9. There are several files in each KOG#### folder but the one I care about is named like KOG####_final.fasta. I am trying to write a script to copy all of the KOG####_final.fasta files to the same directory and then apply some filters to them.

For the filters, I want to go through each of the KOG####_final.fasta files and remove any of them that don't contain at least 10 different text strings that are specified in a text file or somewhere in the script. I'd also like to have a filter that removes files that have more than one instance of any one string.

I know this is a lot but I'm really stumped as to where to start on this one. Any assistance in getting started with this would be much appreciated!

Thanks!
Kevin

rdcwayx · November 4, 2009, 8:54pm

For copy, you can use below command:

find KOG* -type f -name "KOG*_final.fasta" -exec cp {} /tmp \;

But not understand the filter, could you paste some sample KOG*_final.fasta, and give us the sample output.

kmkocot · November 6, 2009, 5:29pm

Thanks for your help! That worked really well.

The KOG*_final.fasta files look like the example below. There is a one-line header that always begins with a greater-than sign, has a 3-4 letter species abbreviation, and a sequence identifier. The next line contains the corresponding amino-acid sequence which is always on one line and doesn't wrap no matter how long it is.

>ACAL_12345
XESLGRQVPSELFEKLDYHK
>ACAL_19472
XESLGRQVPSEXFEKLDYHJ
>ACAL_19473
XESLEKDVPSELFEKLDYHJ
>CGIG_Contig2554
XESLGRQVPSQLFEKLDYHK
>CVIR_Contig1338
XESLGRQVPSELEEKLDYHK
>HROB_98421
XESLGRQVPSELFEKLDYEV
>IPAR_Contig854
QESLGRQVPSELFEKLDYHK
>LGIG_182182
PESLGRQVPSELFEKLDYHD
>MCAL_Contig3433
XESLGRQVPSELFEKLDYHG
>NVEC_166966
XESLGRQVPSELFEKLDYHK

I'm trying to write a script that will go through each of these files and check them to see if they meet certain criteria. For example, I want to move all files containing fewer than 10 greater-than signs (fewer than 10 sequences) into a "trash" folder. I've played around using if and grep -c \> for this part but I haven't figured it out yet. Is there a better way to go about this?

I'd also like to trash any files that have more than 1 sequence for any one species (although I'd like to be able to vary this number if it turns out that is too strict). Would I have to use an array for this? Or another file that specifies all of the taxon names?

Thanks!
Kevin

---------- Post updated 11-06-09 at 04:29 PM ---------- Previous update was 11-05-09 at 06:38 PM ----------

I figured out the first filter:

for FileName in *.fa
do
sequences=`grep -c \> $FileName`
cutoff=6
echo $FileName $sequences
if [ "$sequences" -lt "$cutoff" ] ; then
printf "Too few sequences in file $FileName"
mv $FileName ./rejected_few_seq/
fi
done

I'm having trouble figuring out the other part. Here's what I've got so far:

for FileName in *.fa
do
grep -c ACAL_ $FileName >> taxon_count.txt
grep -c HROB_ $FileName >> taxon_count.txt
(...repeated for all species abbreviations)
?
done

I am trying to figure out how to add all the values put into the taxon_count.txt file and remove $FileName if that value is smaller than a desired value. I'd also like to set a max value for number of sequences per taxon and if that is exceeded, remove #FileName. Any guidance would be greatly appreciated.

Thanks,
Kevin

rdcwayx · November 10, 2009, 12:11am

You need this?

 awk -F_ '/^>/ {print $1 }' $FileName|sort |uniq -c |sort -n
      1 >CGIG
      1 >CVIR
      1 >HROB
      1 >IPAR
      1 >LGIG
      1 >MCAL
      1 >NVEC
      3 >ACAL