Modifying col values based on another col

newbie83 · January 25, 2013, 11:36am

Hi,

Please help with this.
I have several excel files (with and .xlsx format) with 10-15 columns each.
They all have the same type of data but the columns are not ordered in the same way.

Here is a 3 column example. What I want to do add the alphabet
from column 2 to column 3, provided they both have the keyword 'BAM'
in them. Also the prefix 12 needs to be removed (if it is present before BAM).
Also the second column may have a space between BAM and 2A, that space needs to be ignored. So BAM2A is the same as BAM 2A, similarly if space is present in 3rd col, that has to be removed in the output.
So 12BAM2 -1 becomes BAM2A-1.
The last row remains unchanged because both column dont have BAM.

Input sample

20	12BAM 2A		12BAM2 -1
20	12BAM 2A		12BAM2 -2
50	12BAM2A   		12BAM2 -3
20	12BAM2B  		12BAM2 -4
20	12BAM 2B		12BAM2 -5
520	12BAM 2A		XA3023

Expected output

20	12BAM 2A		BAM2A-1
20	12BAM 2A		BAM2A-2
50	12BAM2A  	   	BAM2A-3
20	12BAM2B  		BAM2B-4
20	12BAM 2B		BAM2B-5
520	12BAM 2A		XA3023

Yoda · January 25, 2013, 12:07pm

awk -F'\t' '{
 i=$2;j=$2;
 sub(/[0-9]+/,"",i);
 sub(/[0-9].*/,"",i);
 sub(" ","",i);
 sub(i," ",j);
 sub(/^[0-9]+/,"",j);
 if($4~i) {
  sub(/^[0-9]+/,"",$4);
  sub(/[0-9]+/,j,$4);
  gsub(" ","",$4);
 }
}1' OFS='\t' file

newbie83 · January 25, 2013, 12:12pm

thanks, will this work on the excel files?

Corona688 · January 25, 2013, 12:17pm

excel can so easily export and import tab-separated data that there's little point installing 19,000 perl modules and writing 3,000 lines of code just to write excel files.

Excel files are more complicated in several ways, too. You have to start worrying about multiple worksheets for instance, formulas, merged columns, and all of that. And the .xlsx files are particularly obnoxious.

newbie83 · January 25, 2013, 12:28pm

I will try to run the code now with exported text, but naively looking at the code, it doesn't consider the keyword 'BAM' to be matched for both columns, is there another way that is being matched?

Yoda · January 25, 2013, 12:30pm

Actually it is considering the keyword BAM

I wrote it in a generic way so that other keywords will also be matched. Try running for a set of data and verify if you are getting the desired output. Just make sure you preserve the tab spaces since the code is using it as field separator.

newbie83 · January 25, 2013, 12:45pm

the code runs perfect with the sample data,

One of my actual input files has col 2 as col 8, and col 3 as col 7.
I can replace $2 by $8, what should i replace $4 with ? there is no mention of col3 in the code.
Sorry for the naive question :-/

Yoda · January 25, 2013, 12:55pm

In this code we are modifying just $4 . In the sample data that you provided $3 is actually blank because there are 2 tab spaces after $2

So basically you have to identify new positions of $2 & $4 and replace it accordingly in this code.

You can use od to view the tab spaces or use awk itself to find the field positions:

od -c filename

awk -F'\t' '{ print $1, $2, $3, $4, $5, $6, $7, $8; }' OFS=# filename

I hope this helps.

newbie83 · January 25, 2013, 1:06pm

I understand now, that helped a lot, thank you

Got desired op, thanks a ton

Scrutinizer · January 25, 2013, 1:24pm

Alternatively:

awk '$NF~/BAM/{sub(/.* /,$2,$NF); sub(/../,x,$NF); gsub(" ",x,$NF)}1' FS='\t' OFS='\t'  file