Problem with my loop and awk script

vuvuzelo · August 18, 2014, 9:30am

Sorry if this is a super simple issue, but am extremely new to this and am trying to teach myself as I go along. But can someone please help me out?

I have a data file similar to this for many samples, for all chromosomes

Sample      Chr     bp           p       roh
Sample1	1	49598178	0	1
Sample1	1	49598207	0	1
Sample1	1	49598209	0	1
Sample1	2	49974371	0	1
Sample1	2	49974670	0	1
Sample1	2	49974931	0	1
Sample1	2	50003025	0.000001	1
 etc etc

I would like to determine the maximum and minimum bp for each chromosome for each sample, and then output the distance between the min and max bp for each chromosome for each sample.

As I am super new to all this, and I have noone to help me, I was wondering if someone here could help me out?

I have so far written this:

for sample in `listofsamples.txt`
do
awk 'chr=$2; for (chr=1; chr<=25; chr++) { 
     NR==1 { min=$3; max=$3; length=0; next } 
           { max < $3 {max=$3} min > $3 {min=$3} } 
     END { roh=(max-min)/1000000; print "sample", "chr", "min", "max", "length"; print $sample, chr, min, max, length }}' awktestfile.txt
done

And I keep getting the thi following syntax error message

'/file: line 2: syntax error near unexpected token `do
'/file: line 2: `do

I have no idea what it means - please help? Any advice would be greatly appreciated.

Best

V

blackrageous · August 18, 2014, 11:40am

This line seems wrong

for sample in `listofsamples.txt`

The back ticks imply that listofsamples.txt is a command. It seems you want
to process the contents of listofsamples.txt. This would make more sense

for sample in `cat listofsamples.txt`

but I am not sure that is what you really need. What is in the file awktestfile.txt?

vuvuzelo · August 18, 2014, 11:49am

Thanks blackrageous. The awktestfile.txt is the data file I am trying to analyse.

Don_Cragun · August 18, 2014, 1:01pm

For the sample input shown, what output are you hoping to produce?

Are all lines for a given Sample/Chr pair on adjacent lines in your input file? And, if not, how big are your input files?

vuvuzelo · August 19, 2014, 4:51am

Thanks Don, I have a single data file which simply lists all base pairs for all chromosomes for the samples one under the other, so the file is quite large (there are over 3000 samples). Ideally I would like to output for each sample, the max and min bp for each chromosome eg.

Sample1	1	49598178	49598209
Sample1	2	49974371	50003025

From this I think I should be able to continue my analysis.

Thanks

RudiC · August 19, 2014, 5:25am

Try

awk     'NR==1          {next}
                        {samchr=$1" "$2}
         samchr != osc  {print osc, MIN, MAX, (MAX-MIN)/10000; osc=samchr; MIN=1E24; MAX=-1E24}
         $3 > MAX       {MAX = $3}
         $3 < MIN       {MIN = $3}
         END            {print samchr, MIN, MAX, (MAX-MIN)/10000}
        ' file
Sample1 1 49598178 49598209 0.0031
Sample1 2 49974371 50003025 2.8654

vuvuzelo · August 19, 2014, 12:23pm

Thanks RudiC, that's amazing. I seem to always over complicate it.

penza88 · September 18, 2014, 1:52am

I should be able to continue my analysis.

vuvuzelo · October 20, 2014, 9:00am

Thanks to everyone who helped me out in the past with this problem, but I seem to have found another error - not in the script kindly provided but in hindsight I should've thought this through better. My apologies to all for re-opening this thread, please let me know if I need to open a nee thread?

I did not think that my regions of interest might not be in one block, but actually broken up, an example is shown below:

Sample	Chr	bp	p	value
id1	chr1	26073240	0.000000	1
id1	chr1	26073242	0.000000	1
id1	chr1	26075748	0.000000	1
id1	chr1	26075769	0.000000	1
id1	chr1	26075787	0.000000	1
id1	chr1	26079982	0.000000	1
id1	chr1	26080019	0.000000	1
id1	chr1	26080052	0.000000	1
id1	chr1	26085161	0.000000	1
id1	chr1	26085191	0.000000	1
id1	chr1	26090437	0.000000	1
id1	chr1	26098259	0.000000	1
id1	chr1	26204597	0.000000	1
id1	chr1	26204623	0.000000	1
id1	chr1	26204770	0.000000	1
id1	chr1	26207499	0.000000	1
id1	chr1	26207507	0.000000	1
id1	chr1	26207550	0.000000	1
id1	chr1	26209126	0.000000	1
id1	chr1	26209180	0.000000	1
id1	chr1	26210208	0.000000	1
id1	chr1	26210236	0.000000	1
id1	chr1	26210298	0.000000	1
id1	chr1	26210308	0.000000	1
id1	chr1	26210309	0.000000	1
id1	chr1	26210311	0.000000	1

The previous script gave me the following output:

id1	chr1	26073240 26210311 1.37071

However, on line 12 (excl headers) there is a massive gap, how can I change my script to reflect the following:

id1	chr1	26073240  26098259  0.25019
id1	chr1	26204597  26210311  0.05714

I have made numerous attempts, but dork here cannot figure it out. Any help would be greatly assisted.

Thanks a million

vuvuzelo

RudiC · October 20, 2014, 9:30am

How would you tell a gap acceptable from another inacceptable?

vuvuzelo · October 20, 2014, 9:34am

Doh! Any gap greater than 200,000bp would be considered unacceptable.

RudiC · October 20, 2014, 9:35am

Then you have just one block above! The gap is 106338 only...

vuvuzelo · October 20, 2014, 9:39am

Yes RudiC, the input is just an example to illustrate my problem.

Here is a modified file:

Sample	Chr	bp	p	value
id1	chr1	26073240	0.000000	1
id1	chr1	26073242	0.000000	1
id1	chr1	26075748	0.000000	1
id1	chr1	26075769	0.000000	1
id1	chr1	26075787	0.000000	1
id1	chr1	26079982	0.000000	1
id1	chr1	26080019	0.000000	1
id1	chr1	26080052	0.000000	1
id1	chr1	26085161	0.000000	1
id1	chr1	26085191	0.000000	1
id1	chr1	26090437	0.000000	1
id1	chr1	26098259	0.000000	1
id1	chr1	26304597	0.000000	1
id1	chr1	26304623	0.000000	1
id1	chr1	26304770	0.000000	1
id1	chr1	26307499	0.000000	1
id1	chr1	26307507	0.000000	1
id1	chr1	26307550	0.000000	1
id1	chr1	26309126	0.000000	1
id1	chr1	26309180	0.000000	1
id1	chr1	26310208	0.000000	1
id1	chr1	26310236	0.000000	1
id1	chr1	26310298	0.000000	1
id1	chr1	26310308	0.000000	1
id1	chr1	26310309	0.000000	1
id1	chr1	26310311	0.000000	1

RudiC · October 20, 2014, 9:42am

Try

awk     'NR==1          {next}
                        {samchr=$1" "$2}
         samchr != osc ||
         ($3-OV)>GAP    {print osc, MIN, MAX, (MAX-MIN)/10000; osc=samchr; MIN=1E24; MAX=-1E24}
         $3 > MAX       {MAX = $3}
         $3 < MIN       {MIN = $3}
                        {OV=$3}
         END            {print samchr, MIN, MAX, (MAX-MIN)/10000}
        ' GAP="2E5" file
id1 chr1 26073240 26098259 2.5019
id1 chr1 26304597 26310311 0.5714

vuvuzelo · October 20, 2014, 12:36pm

Once again that's fantastic. Much appreciated.