awk script to (un)/concatenate fields in file

anthony.cros · March 22, 2010, 11:24am

Hi everyone,

I'm trying to use the "join" function for more than 1 field. Since it's not possible as it is, I want to take my input files and concatenate the joining fields as 1 field (separated by "|"). I wrote 2 awk script to do and undo it (see below). However I'm new to awk and I'm certain I could do it in a much more efficient way.

I found various topics around the question but often the syntax proposed is a bit of a mystery to me. For instance someone posted this:

BEGIN{FS=OFS="\t"}NR==FNR{a[$1$2]=$4;b[$1$2]=$5;c[$1$2]=$6;next}{$4=$4-a[$1$2];$5=$5-b[$1$2];$6=$6-c[$1$2]}1

what does the trailing '1' mean? what are there 2 separated {} and what distinguish them? finally, where can I find doc on that kind of questions (googling "awk trailing digit" didn't help me much!!)

Here are my scripts, I don't care much about syntax shortcuts, I only care about speed of execution!

any help would be greatly appreciated

to concatenate:

#!/bin/sh
#
# usage:
#     nawk -F$'\t' -v JF=3,5 -f concatene.awk ~/tmp/tmp15
#     nawk -F$'\t' -v JF=15,16,17,18 -f concatene.awk split/snp_j > concat
#
# JF stands for "join fields"
BEGIN { FS="\t";OFS="\t" }
{ 
    if (NR==1) {    # to do it only once (NR starts at 1)
        N=split(JF,JFS,",");
        for (i=1;i<=N;i++) {    # reverse it
            RJFS[JFS] = i;
        }
    }

    LINE="";
    for (FIELD_INDEX=1 ; FIELD_INDEX<=N ; FIELD_INDEX++ ) {
        LINE=(FIELD_INDEX==1 ? "" : LINE"|")$JFS[FIELD_INDEX];
    }
    for (FIELD_INDEX=1 ; FIELD_INDEX<=NF ; FIELD_INDEX++ ) {
        if (!RJFS[FIELD_INDEX]) {
            LINE=LINE"\t"$FIELD_INDEX;
        }
    }
    print LINE;
}

example:
input: a b c d e f
output: c|e a b d f

to "un"concatenate:

#!/bin/sh
# nawk -F$'\t' -v JF=3,5 -f unconcatene.awk test
BEGIN { FS="\t";OFS="\t" }
{ 
    if (NR==1) {    # to do it only once (NR starts at 1)
        N=split(JF,JFS,",");
        for (i=1;i<=N;i++) {    # reverse it
            RJFS[JFS] = i;
        }
    }

    N2=split($1,JFS2,"|");    # N=N2
    for (i=1;i<=N;i++) {    # reverse it
        RJFS[JFS] = JFS2;
    }

    SIZE=NF-1+N;
    FIELD_INDEX=2;
    LINE="";
    for (NEW_FIELD_INDEX=1 ; NEW_FIELD_INDEX<=SIZE ; NEW_FIELD_INDEX++ ) {
        LINE=LINE(NEW_FIELD_INDEX==1 ? "" : "\t");
        if (RJFS[NEW_FIELD_INDEX]) {
            LINE=(LINE)RJFS[NEW_FIELD_INDEX];
        } else {
            LINE=(LINE)$FIELD_INDEX;        
            FIELD_INDEX++;
        }
    }
    print LINE;
}

Thanks!!

example:
input: c|e a b d f
output: a b c d e f

Anthony

malcomex999 · March 23, 2010, 2:05am

Can you explain a little bit more on how you want to get this..

input: a b c d e f
output: c|e a b d f

anthony.cros · March 23, 2010, 9:55am

Hi,

First thanks for responding!
I'm not sure what you mean by "how" I want to get this but I'll give you a more thorough example:

I have this file for instance (TSV):

a    b    c    d    e    f
g    h    i    j    k    l
m    n    o    p    q    r

And say I want to join fields 2, 3 and 6 with 3 columns of another file. Because join uses only 1 field, I want to put the fields 2, 3 and 6 together separated by only pipe (as opposed to my other fields separated with tabs). So the result of the concatene.awk script will give me the following:

b|c|f    a    d    e
h|i|l    g    j    k
n|o|r    m    p    q

to do so in the current script, I pass "2,3,6" as a parameter and for each line create two arrays like:
(example for the first line only)
JFS[0] = b, JFS[1] = c, JFS[2] = f
RJFS[2] = b, RJFS[3] = c, RJFS[6] = f
from there I rebuild my line by first going through JFS with a pipe separation, then adding the other fields with a tab separation by going through the NF fields and ignoring the ones for which RJFS[field] exist.

Hope this makes more sense! I bet there is a way to do it in a much more optimized way though..!

Franklin52 · March 23, 2010, 10:08am

Can you post some lines of your input files and the desired output?

vgersh99 · March 23, 2010, 10:18am

nawk -f anthony.awk myFile
OR
nawk -v jf='2,3,6' -f anthony.awk myFile

anthony.awk:

BEGIN {
  FS=OFS="\t"

  SEP_jf="|"
  if (!jf) jf="3,5"

  n=split(jf, jfA, ",")
  for(i=1;i<=n;i++)
    jfO[jfA]
}
{
  line=jfS=""
  for(i=1;i<=NF;i++)
    if (i in jfO)
       jfS=(jfS)?jfS SEP_jf $i: $i
    else
      line=(line)?line OFS $i:$i
  print jfS, line
}

anthony.cros · March 23, 2010, 2:47pm

Thanks a lot for your response,

i see that using BEGIN is cleaner than my "if (NR==1)" and that "if (i in JFO)" exists is good to know!!

Franklin, my post from 9:55 describes it pretty well, what info are you missing?