Deleting files that don't contain particular text strings / more than one instance of a string

Hi all,

I have a directory containing many subdirectories each named like KOG#### where # represents any digit 0-9. There are several files in each KOG#### folder but the one I care about is named like KOG####_final.fasta. I am trying to write a script to copy all of the KOG####_final.fasta files to the same directory and then apply some filters to them.

For the filters, I want to go through each of the KOG####_final.fasta files and remove any of them that don't contain at least 10 different text strings that are specified in a text file or somewhere in the script. I'd also like to have a filter that removes files that have more than one instance of any one string.

I know this is a lot but I'm really stumped as to where to start on this one. Any assistance in getting started with this would be much appreciated!

Thanks!
Kevin

For copy, you can use below command:

find KOG* -type f -name "KOG*_final.fasta" -exec cp {} /tmp \;

But not understand the filter, could you paste some sample KOG*_final.fasta, and give us the sample output.

Thanks for your help! That worked really well.

The KOG*_final.fasta files look like the example below. There is a one-line header that always begins with a greater-than sign, has a 3-4 letter species abbreviation, and a sequence identifier. The next line contains the corresponding amino-acid sequence which is always on one line and doesn't wrap no matter how long it is.

>ACAL_12345
XESLGRQVPSELFEKLDYHK
>ACAL_19472
XESLGRQVPSEXFEKLDYHJ
>ACAL_19473
XESLEKDVPSELFEKLDYHJ
>CGIG_Contig2554
XESLGRQVPSQLFEKLDYHK
>CVIR_Contig1338
XESLGRQVPSELEEKLDYHK
>HROB_98421
XESLGRQVPSELFEKLDYEV
>IPAR_Contig854
QESLGRQVPSELFEKLDYHK
>LGIG_182182
PESLGRQVPSELFEKLDYHD
>MCAL_Contig3433
XESLGRQVPSELFEKLDYHG
>NVEC_166966
XESLGRQVPSELFEKLDYHK

I'm trying to write a script that will go through each of these files and check them to see if they meet certain criteria. For example, I want to move all files containing fewer than 10 greater-than signs (fewer than 10 sequences) into a "trash" folder. I've played around using if and grep -c \> for this part but I haven't figured it out yet. Is there a better way to go about this?

I'd also like to trash any files that have more than 1 sequence for any one species (although I'd like to be able to vary this number if it turns out that is too strict). Would I have to use an array for this? Or another file that specifies all of the taxon names?

Thanks!
Kevin

---------- Post updated 11-06-09 at 04:29 PM ---------- Previous update was 11-05-09 at 06:38 PM ----------

I figured out the first filter:

for FileName in *.fa
do
sequences=`grep -c \> $FileName`
cutoff=6
echo $FileName $sequences
if [ "$sequences" -lt "$cutoff" ] ; then
printf "Too few sequences in file $FileName"
mv $FileName ./rejected_few_seq/
fi
done

I'm having trouble figuring out the other part. Here's what I've got so far:

for FileName in *.fa
do
grep -c ACAL_ $FileName >> taxon_count.txt
grep -c HROB_ $FileName >> taxon_count.txt
(...repeated for all species abbreviations)
?
done

I am trying to figure out how to add all the values put into the taxon_count.txt file and remove $FileName if that value is smaller than a desired value. I'd also like to set a max value for number of sequences per taxon and if that is exceeded, remove #FileName. Any guidance would be greatly appreciated.

Thanks,
Kevin

You need this?

 awk -F_ '/^>/ {print $1 }' $FileName|sort |uniq -c |sort -n
      1 >CGIG
      1 >CVIR
      1 >HROB
      1 >IPAR
      1 >LGIG
      1 >MCAL
      1 >NVEC
      3 >ACAL