awk to parse field and include the text of 1 pipe in field 4

I am trying to parse the input in awk to include the |gc= in $4 but am not able to. The below is close:
awk so far:

awk '{sub(/\|[^[:blank:]]+[[:blank:]]+[0-9]+/, ""); print }' input.txt

Input

chr1    955543  955763  AGRN-6|pr=2|gc=75   0   + 
chr1    957571  957852  AGRN-7|pr=3|gc=61.2 0   + 
chr1    970621  970740  AGRN-8|pr=1|gc=57.1 0   + 

Current Output

chr1    955543  955763  AGRN-6  + 
chr1    957571  957852  AGRN-7  + 
chr1    970621  970740  AGRN-8  + 

Desired Output (each field separated by a tab)

chr1    955543  955763  AGRN-6|gc=75    + 
chr1    957571  957852  AGRN-7|gc=61.2  + 
chr1    970621  970740  AGRN-8|gc=57.1  + 
awk '{
          printf("%s\t%s\t%s\t%s\t%s\n", $1,$2,$3,$4,$6)
         }' oldfile >newfile

Just do not print column #5, assuming your examples for input are correct. You can also play with the awk OS variable to get tab separation.

1 Like

That `awk` produces:

chr1    955543    955763    AGRN-6|pr=2|gc=75    +    
 
chr1    957571    957852    AGRN-7|pr=3|gc=61.2    +  
   
chr1    970621    970740    AGRN-8|pr=1|gc=57.1    +

The |pr=2, |pr=3, and pr=1 is not needed and there looks to be a line skipped each after each row and that will may be problematic for later analysis.

Thank you :).

awk '{n=split($4, a, "|"); print $1, $2, $3, a[1]"|"a[n], $6}' cmccabe.file

or:

awk '{n=split($4, a, "|"); print $1,$2,$3,a[1]"|"a[n],$6}' OFS="\t" cmccabe.file
1 Like

I had something similar @Aia

awk '{split($4,a,"|"); print $1,$2,$3,a[1],"|",a[3],$6}' input
chr1 955543 955763 AGRN-6 | gc=75 + 
chr1 957571 957852 AGRN-7 | gc=61.2 + 
chr1 970621 970740 AGRN-8 | gc=57.1 +

but that outputs everything on one line. Your awk is much better, thank you :).

1 Like

Yes, those highlighted red commas get translated into OFS.

Here's a Perl alternative:

 perl -pe 's/(\|\w+=[\w\.]+){1,2}\s+\d+/$1/' cmccabe.file 

How about

awk '{sub ("\|.*\|", "|")}1' file
chr1    955543  955763  AGRN-6|gc=75   0   + 
chr1    957571  957852  AGRN-7|gc=61.2 0   + 
chr1    970621  970740  AGRN-8|gc=57.1 0   +

?

Part of the requirement is to remove the 5th field.

awk '{sub (/\|.*\|/, "|", $4)}{print $1,$2,$3,$4,$6}' file
chr1 955543 955763 AGRN-6|gc=75 +
chr1 957571 957852 AGRN-7|gc=61.2 +
chr1 970621 970740 AGRN-8|gc=57.1 +
chr1 970621 970740 AGRN-8|gc=57.1 +