Run a program-print parameters to output file-replace op file contents with max 4th col

jacobs.smith · January 23, 2013, 12:01pm

Hi Friends,

This is the only solution to my task. So, any help is highly appreciated.

I have a file

cat input1.bed

chr1 100 200 abc
chr1 120 300 def
chr1 145 226 ghi
chr2 567 600 unix

Now, I have another file by name

input2.bed (This file is a binary file not readable by the terminal).

But, there is a program in our field that executes by taking this

input2.bed

program input_file -chrom -start -end output_file

Now, my task is this

Read input1.bed's each record
Feed it in the following way to the program, so that the program executes in a continuous loop for each record in input1.bed this way and generate the output files with each input1.bed's record as their name

program input2.bed -chrom=chr1 -start=100 -end=200 chr1_100_200_op.bed
program input2.bed -chrom=chr1 -start=120 -end=300 chr1_120_300_op.bed
program input2.bed -chrom=chr1 -start=145 -end=226 chr1_145_226_op.bed
program input2.bed -chrom=chr2 -start=567 -end=600 chr2_567_600_op.bed

For example, I consider the first output file -

chr1_100_200_op.bed

.

cat chr1_100_200_op.bed

chr1 110 120 45.67
chr1 177 189 98.50
chr1 195 200 111.11

Now, ignore the first three columns of the above output file, but consider the maximum fourth column value, which is 111.11 and replace the entire contents of my chr1_100_200_op.bed with just the file name, which will be this one

cat chr1_100_200_op.bed

chr1_100_200 111.11

This is it. Please ask me as many questions as you have for a better solution. Thanks a ton for all your time.

Corona688 · January 23, 2013, 1:27pm

while read CHROM START END NAME
do
        # Create the bed file
        program input2.bed -chrom=$CHROM -start=$START -end=$END ${CHROM}_${START}_${END}_op.bed

        # Replace column 1 with filename,
        # column 2 with the last column,
        # reduce it to 2 columns,
        # and print all lines.
        awk '{$1=F ; $2=$NF; NF=2 } 1' F="${CHROM}_${START}_${END}" ${CHROM}_${START}_${END}_op.bed > /tmp/$$
        cat /tmp/$$ > ${CHROM}_${START}_${END}_op.bed
done < input1.bed
# Remove temporary file
rm -f /tmp/$$

For 3 and 4, you start with 3 lines and end with 1 line. Is this intended? I've assumed it's not, that you want 3 lines out for 3 lines in.

jacobs.smith · January 23, 2013, 1:43pm

corona688:

while read CHROM START END NAME
do
   # Create the bed file
   program input2.bed -chrom=$CHROM -start=$START -end=$END ${CHROM}_${START}_${END}_op.bed

   # Replace column 1 with filename,
   # column 2 with the last column,
   # reduce it to 2 columns,
   # and print all lines.
   awk '{$1=F ; $2=$NF; NF=2 } 1' F="${CHROM}_${START}_${END}" ${CHROM}_${START}_${END}_op.bed > /tmp/$$
   cat /tmp/$$ > ${CHROM}_${START}_${END}_op.bed
done < input1.bed
# Remove temporary file
rm -f /tmp/$$

For 3 and 4, you start with 3 lines and end with 1 line. Is this intended? I've assumed it's not, that you want 3 lines out for 3 lines in.

Hi Corona,

Thanks for your time.

For 3 and 4, usually the output file has thousands of records. But, I want to consider the maximum value of fourth column and print the filename as another column.

So, the three records will go out and only one record will remain, as in the example.

Corona688 · January 23, 2013, 1:49pm

while read CHROM START END NAME
do
        # Create the bed file
        program input2.bed -chrom=$CHROM -start=$START -end=$END ${CHROM}_${START}_${END}_op.bed

        # Replace column 1 with filename,
        # column 2 with the last column,
        # reduce it to 2 columns,
        # and print all lines.
        awk '(!M)||(M<$NF){ M=$NF } END { print F, M }' F="${CHROM}_${START}_${END}" ${CHROM}_${START}_${END}_op.bed > /tmp/$$
        cat /tmp/$$ > ${CHROM}_${START}_${END}_op.bed
done < input1.bed
# Remove temporary file
rm -f /tmp/$$

jacobs.smith · January 24, 2013, 9:46am

Thanks corona for your quick solution. It took me a while to make my input files and cross check the output files.

The only problem I am getting here is that, for some combinations of the start and end there is no data in my input2.bed.

So, the output file is printing blank spaces, for example like this

cat output.bed
chr1 100 200 45.09999
chr1 120 130 
chr1 145 178 78.999

How do I replace that empty space on column 4 with "ND"?

My output would be

cat output.bed
chr1 100 200 45.09999
chr1 120 130 ND
chr1 145 178 78.999

Corona688 · January 24, 2013, 10:29am

Try sed -i 's/ $/ ND/' output.bed

jacobs.smith · January 24, 2013, 12:17pm

Its not generating any output.

Corona688 · January 24, 2013, 1:08pm

It doesn't, -i tells GNU sed to edit and replace the original file.

If you want a new file, leave off the -i and redirect the output to a new file.