How to extract some parts of a file to create some outfile

iammitra · May 8, 2009, 9:49am

Hi All,
I am very new in programming. I need some help.
I have one input file like:

Number of disabled taxa: 9
Loading mapping file: ncbi.map
Load mapping:
taxId2TaxLevel: 469951
--- Subsample reads (20%): 66680 of 334386
Processing: tree-from-summary
Running tree-from-summary algorithm
Taxonomy:
Gammaproteobacteria: 2767
Alphaproteobacteria: 4123
Deltaproteobacteria: 1343
Epsilonproteobacteria: 26
Not assigned: 1445
No hits: 220253
+++++++++++End of summary for file: B-Red-sum.txt
--- Subsample reads (20%): 67037 of 334386
Processing: tree-from-summary
Running tree-from-summary algorithm
Taxonomy:
Gammaproteobacteria: 2809
Alphaproteobacteria: 4001
Deltaproteobacteria: 1208
Epsilonproteobacteria: 15
Not assigned: 299
No hits: 461890
+++++++++++End of summary for file: B-Red-sum.txt

::::: and so on

I want to create some output like:
Out file1.txt(which grep from, next line of "Taxonomy:" upto "+++++++++++End" ) with no space in front of line and so on.

So the desired ouput will be:
outfile1.txt
Gammaproteobacteria: 2767
Alphaproteobacteria: 4123
Deltaproteobacteria: 1343
Epsilonproteobacteria: 26
Not assigned: 1445
No hits: 220253

outfile2.txt
Gammaproteobacteria: 2809
Alphaproteobacteria: 4001
Deltaproteobacteria: 1208
Epsilonproteobacteria: 15
Not assigned: 299
No hits: 461890

and so on.

Can anybody please help me in this matter?

I tried with some code like this. But didn't workout.
--------------------------------------------------------------------------
#!/bin/tcsh
if $#argv != "1" then
echo "Usage: process-file-script 1st-output-file-as-inputfile"
exit 0
endif

FIL_NM=$1

str=""
cat $FIL_NM | while read LINE
do
if [ "`echo $LINE | awk '{print $1}'`" = "+++++++++++Begin" ] ; then
n=1
c=1
fi
if [ "`echo $LINE |grep Gamma`"] ; then
NEW_FIL_NM=$FIL_NM"_"$n.txt"
fi

fi
if [ "`echo $LINE | awk '{print $1}'`" = "+++++++++++End" ] ; then
n=0
fi
done
--------------------------------------------------------
Please help...
Many thanks in advance...
Best wishes,
Mitra

vgersh99 · May 8, 2009, 10:00am

nawk '
    /^Taxonomy/ {p=6;close(out);out="output" ++cnt ".txt";next}
    p &&p-- { print > out }' myInputFile

ghostdog74 · May 8, 2009, 10:10am

if you have Python, here's an alternative solution

f=0;i=0
for line in open("file"):
    line=line.strip()
    if line.startswith("+++++++++++"): 
        f=0
        o.close()
    if "Taxonomy:" in line: 
        f=1;i=i+1
        o=open("out_"+str(i)+".txt","w")
    if f:
        print >>o, line

iammitra · May 8, 2009, 10:14am

Hallo ghostdog74,
Thanks for your reply. But I am sorry to say that I forgot to mention : in my input file there are not always only 6 lines. I just copied some lines.. This lines varies from 100 to 200. So it is necessary for the program to read +++++++++End.

Thanks a lot,
Mitra.

durden_tyler · May 8, 2009, 10:14am

And here's a perl solution:

$
$
$ cat input.txt
Number of disabled taxa: 9
Loading mapping file: ncbi.map
Load mapping:
taxId2TaxLevel: 469951
--- Subsample reads (20%): 66680 of 334386
Processing: tree-from-summary
Running tree-from-summary algorithm
Taxonomy:
Gammaproteobacteria: 2767
Alphaproteobacteria: 4123
Deltaproteobacteria: 1343
Epsilonproteobacteria: 26
Not assigned: 1445
No hits: 220253
+++++++++++End of summary for file: B-Red-sum.txt
--- Subsample reads (20%): 67037 of 334386
Processing: tree-from-summary
Running tree-from-summary algorithm
Taxonomy:
Gammaproteobacteria: 2809
Alphaproteobacteria: 4001
Deltaproteobacteria: 1208
Epsilonproteobacteria: 15
Not assigned: 299
No hits: 461890
+++++++++++End of summary for file: B-Red-sum.txt
::::: and so on
$
$
$
$ perl -ne '{$/=""; $i=1;
>   while (/^Taxonomy:.(.*?)\+{11}/msgi) {
>     open(OUT,">outfile".$i++.".txt"); print OUT $1; close(OUT);
>   }}' input.txt
$
$
$ cat outfile1.txt
Gammaproteobacteria: 2767
Alphaproteobacteria: 4123
Deltaproteobacteria: 1343
Epsilonproteobacteria: 26
Not assigned: 1445
No hits: 220253
$
$
$ cat outfile2.txt
Gammaproteobacteria: 2809
Alphaproteobacteria: 4001
Deltaproteobacteria: 1208
Epsilonproteobacteria: 15
Not assigned: 299
No hits: 461890
$
$

tyler_durden

vgersh99 · May 8, 2009, 10:21am

nawk '
   /^Taxonomy/ {p++;close(out);out="output" ++cnt ".txt";next}
   /^[+]+End/ { p=0}
   p { print > out }' myInputFile

ghostdog74 · May 8, 2009, 10:26am

well, i am not sure i get you, but i see other solutions include "End', therefore if you are sure that ++++++++ is not unique, you can add "End"

....
if line.startswith("+++++++++++End"): 
....

iammitra · May 8, 2009, 10:39am

Hallo durden_tyler,
your perl code works. Thanks a lot. But there is still one problem.
As I told in my input file there are several amount of spaces in front desired lines.
Is there any possibility to get rid of these space directly?
Now it is giving:

mitra:~ mitra$ cat outfile1.txt
          Gammaproteobacteria: 2767
       Alphaproteobacteria: 4123
         Deltaproteobacteria: 1343
                         Epsilonproteobacteria: 26
     Betaproteobacteria: 397
                        unclassified Proteobacteria: 48
                  Spirochaetes (class): 15
        Nitrospira (class): 1
        Bacilli: 25
  Not assigned: 1445
  No hits: 220253

Thank you very much for your help.
Best Wishes,
Mitra.

iammitra · May 8, 2009, 10:40am

Sorry, I don't know why all the spaces disappears here. But there are several spaces (not equal for all lines)in front of desired lines.

iammitra · May 8, 2009, 10:42am

Hallo ghostdog74,
I will try with this modification. If it works.
Thank you very much.
Best,
Mitra.

durden_tyler · May 8, 2009, 11:13pm

Here's one way to do it:

perl -ne '{$/=""; $i=1;
  while (/^Taxonomy:.(.*?)\+{11}/msgi) {
    $x = $1; $x =~ s/(^|\n)\s+/\1/g;
    open(OUT,">outfile".$i++.".txt"); print OUT $x; close(OUT);
  }}' input.txt

Testing on sample data:

$ 
$ cat input.txt
Number of disabled taxa: 9
Loading mapping file: ncbi.map
Load mapping:                 
taxId2TaxLevel: 469951        
--- Subsample reads (20%): 66680 of 334386
Processing: tree-from-summary             
Running tree-from-summary algorithm       
Taxonomy:                                 
    Gammaproteobacteria: 2767             
Alphaproteobacteria: 4123                 
  Deltaproteobacteria: 1343               
     Epsilonproteobacteria: 26            
 Not assigned: 1445                       
    No hits: 220253                       
+++++++++++End of summary for file: B-Red-sum.txt
--- Subsample reads (20%): 67037 of 334386       
Processing: tree-from-summary                    
Running tree-from-summary algorithm
Taxonomy:
      Gammaproteobacteria: 2809
  Alphaproteobacteria: 4001
        Deltaproteobacteria: 1208
    Epsilonproteobacteria: 15
Not assigned: 299
    No hits: 461890
+++++++++++End of summary for file: B-Red-sum.txt
::::: and so on
$
$ perl -ne '{$/=""; $i=1;
  while (/^Taxonomy:.(.*?)\+{11}/msgi) {
    $x = $1; $x =~ s/(^|\n)\s+/\1/g;
    open(OUT,">outfile".$i++.".txt"); print OUT $x; close(OUT);
  }}' input.txt
$
$
$ cat outfile1.txt
Gammaproteobacteria: 2767
Alphaproteobacteria: 4123
Deltaproteobacteria: 1343
Epsilonproteobacteria: 26
Not assigned: 1445
No hits: 220253
$
$ cat outfile2.txt
Gammaproteobacteria: 2809
Alphaproteobacteria: 4001
Deltaproteobacteria: 1208
Epsilonproteobacteria: 15
Not assigned: 299
No hits: 461890
$
$

Hope that helps,
tyler_durden

____________________________________________________________
"This is your life and it's ending one minute at a time."

durden_tyler · May 8, 2009, 11:24pm

The spaces disappear here because you do not enclose your file data or code within the "code" tags. (Notice how the actual code posted by the forum members has a nice little box around it with the title "Code:" at the top.)

If you sandwich the desired text within "code" tags, without any space between "code", "]", "[" and "/" :

[ code ] <your_text_here> [ / code ]

then the leading spaces will be preserved.

Alternatively, if you are feeling lazy to actually type the "code" tags, then you can do this -
(a) select the desired text, and
(b) click on the "#" icon in your Message Box right above the response area
The dynamic script associated with the web page will put the "code" tags for you.

HTH,
tyler_durden

____________________________________________________________
"This is your life and it's ending one minute at a time."

iammitra · May 10, 2009, 4:08am

Hallo durden_tyler,
At first I want to thank you for your help. Thanks a lot...I am very new in scripting. Can you please explain the filed (.*?)\+{11}/msgi) for your code in my thread help?

Actually I am trying to learn. So it will be really helpful. And one more question How can I make this script executable.

My try was:
#!/usr/bin/perl -w

$#ARGV==1 or die "Usage: 2ndprocess-script 1st-output-file-as-inputfile\n";

$input=shift;

perl -ne '{$/=""; $i=1;
while (/^Taxonomy:.(.*?)\+{11}/msgi) {
$x = $1; $x =~ s/(^|\n)\s+/\1/g;
open(OUT,">outfile".$i++.".txt"); print OUT $x; close(OUT);
}}' $1;

-----------------------------
which didn't work.
Can you please help me to learn this?
Thank you very much once again.
Have anice time.
Best wishes,
Mitra

iammitra · May 10, 2009, 4:19am

Hallo durden_tyler,
At first I want to thank you for your help. Thanks for the help in writing also. Now I can use that.Thanks a lot...I am very new in scripting. Can you please explain the filed (.*?)\+{11}/msgi) for your code in my thread help?

Actually I am trying to learn. So it will be really helpful. And one more question How can I make this script executable.

My try was:

-----------------------------
which didn't work.
Can you please help me to learn this?
Thank you very much once again.
Have anice time.
Best wishes,
Mitra

ghostdog74 · May 10, 2009, 7:47am

if you want to use Perl, here's another version more "understandable" as there's less of regular expression.

$i=0;
while (<>){
 chomp;
 if (/\+*End of summary for file/ ){
    $f=0;close(FH);next;
 }    
 if (/Taxonomy:/ ) { 
     open(FH,">>","output_".$i++) or die "Cannot open for writing:$!\n";
     $f=1; next;
 }
 if ($f) { 
    s/^\s+//g; #get rid of spaces in front
    print FH $_."\n";
  }
}

to use the script,

# perl myscript.pl file

iammitra · May 11, 2009, 4:17am

Dear ghostdog74,
My main problem is I am very new in programming. I am trying to learn. So I am not habituated with either perl or python. Both are new to me. Can you please help me to understand how should I make this files executable, like a script? In case of other reply also, when I use the code directly in the terminal then it works, but in all the cases, still I am unable to make these as an executable script with a given input file like $1.
Can you or anyone else please help me in this matter?
Thanks a lot for your help.
With best regards,
Mitra.

summer_cherry · May 11, 2009, 4:28am

below perl code should help you some.

open $fh,"<","a.txt";
my ($flag,$n)=(0,0);
while(<$fh>){
	if(/Taxonomy:/){
		$n++;
		$file=sprintf("outfile%s.txt",$n);
		open FH,"+>$file";
		$flag=1;
		next;
	}
	if(/\++/){
		$flag=0;
		next;
	}
	print FH $_ if $flag==1;
}

iammitra · May 11, 2009, 4:39am

Dear All,
Thanks for your replies, codes and advices.
My main problem is I am very new in programming. I am trying to learn. So I am not habituated with either perl or python. Both are new to me. Can anybody please help me to understand how should I make this files executable, like a script, which I can call afterwords? Suppose if I call the script like code.perl or code.anything else
Everytime I want to give ./code.perl input.txt
My 1st try was:

#!/usr/bin/perl -w

$#ARGV==1 or die "Usage: 2ndprocess-script 1st-output-file-as-inputfile\n";

$name=shift;

$inputfile="`pwd`/$name";

perl -ne '{$/=""; $i=1;
  while (/^Taxonomy:.(.*?)\+{11}/msgi) {
    $x = $1; $x =~ s/(^|\n)\s+/\1/g;
    open(OUT,">outfile".$i++.".txt"); print OUT $x; close(OUT);
  }}' inputfile;

and 2nd try was:

#!/usr/bin/perl -w

$#ARGV==1 or die "Usage: 2ndprocess-script 1st-output-file-as-inputfile\n";

$name=shift;

$inputfile="`pwd`/$name";

open $fh,"<", $inputfile;
my ($flag,$n)=(0,0);
while(<$fh>){
	if(/Taxonomy:/){
		$n++;
		$file=sprintf("outfile%s.txt",$n);
		open FH,"+>$file";
		$flag=1;
		next;
	}
	if(/\++/){
		$flag=0;
		next;
	}
	print FH $_ if $flag==1;
}

But both of them didn't work in a desired way.
Can anybody please help me?
With best regards and many thanks,
Mitra.

iammitra · May 11, 2009, 5:53am

ghostdog74,
Thank you for your help. Your last help for the script works. but still it produces files will spaces in front of lines. How I can get rid of the spaces.
The output looks like
mitra:testNextPart mitra$ more output_0

  Gammaproteobacteria: 2767
        Alphaproteobacteria: 4123
          Deltaproteobacteria: 1343
          Epsilonproteobacteria: 26
        Betaproteobacteria: 397
        unclassified Proteobacteria: 48
          Elusimicrobium: 2
        candidate division WWE1: 9
          Flavobacteria: 2358
          Sphingobacteria: 136
          Bacteroidia: 162
          environmental samples: 21
          Chlorobia: 77
        Planctomycetacia: 40
        Spirochaetes (class): 15
        Nitrospira (class): 1
        Bacilli: 25
  Not assigned: 1445
  No hits: 220253

Sorry to disturb you again and again.
Thanks a lot.
With best regard,
Mitra.

iammitra · May 11, 2009, 7:46am

Dear All,
I was trying like below to get rid off the space in front of the line(see the previous post).

#!/usr/bin/perl -w

$#ARGV==0 or die "Usage: 2ndprocess-megan-script 1st-output-file-as-inputfile\n";

$i=1;
while (<>){
chomp;
   
 if (/Taxonomy:/ ) { 
     $x = $1; $x =~ s/^\s+|\s+$//g;
     open(OUT,">>","output_".$i++) or die "Cannot open for writing:$!\n";
     $f=1; next;
 }
 
 if (/\+*End of summary for file/ ){
    $f=0;close(OUT);next;
 }
 if ($f) { print OUT $_."\n";}
}

But its not working.

Can anybody please help me to have the out put in the form:
Gammaproteobacteria: 2767
Alphaproteobacteria: 4123
Deltaproteobacteria: 1343
Epsilonproteobacteria: 26
Betaproteobacteria: 397
unclassified Proteobacteria: 48
Elusimicrobium: 2
candidate division WWE1: 9
Flavobacteria: 2358
Sphingobacteria: 136
Bacteroidia: 162
environmental samples: 21
Chlorobia: 77
Planctomycetacia: 40
Spirochaetes (class): 15
Nitrospira (class): 1
Bacilli: 25
Not assigned: 1445
No hits: 220253

Thanks a lot.
Best,
Mitra.