Hi everyone,
I'm trying to use the "join" function for more than 1 field. Since it's not possible as it is, I want to take my input files and concatenate the joining fields as 1 field (separated by "|"). I wrote 2 awk script to do and undo it (see below). However I'm new to awk and I'm certain I could do it in a much more efficient way.
I found various topics around the question but often the syntax proposed is a bit of a mystery to me. For instance someone posted this:
BEGIN{FS=OFS="\t"}NR==FNR{a[$1$2]=$4;b[$1$2]=$5;c[$1$2]=$6;next}{$4=$4-a[$1$2];$5=$5-b[$1$2];$6=$6-c[$1$2]}1
what does the trailing '1' mean? what are there 2 separated {} and what distinguish them? finally, where can I find doc on that kind of questions (googling "awk trailing digit" didn't help me much!!)
Here are my scripts, I don't care much about syntax shortcuts, I only care about speed of execution!
any help would be greatly appreciated
to concatenate:
#!/bin/sh
#
# usage:
# nawk -F$'\t' -v JF=3,5 -f concatene.awk ~/tmp/tmp15
# nawk -F$'\t' -v JF=15,16,17,18 -f concatene.awk split/snp_j > concat
#
# JF stands for "join fields"
BEGIN { FS="\t";OFS="\t" }
{
if (NR==1) { # to do it only once (NR starts at 1)
N=split(JF,JFS,",");
for (i=1;i<=N;i++) { # reverse it
RJFS[JFS] = i;
}
}
LINE="";
for (FIELD_INDEX=1 ; FIELD_INDEX<=N ; FIELD_INDEX++ ) {
LINE=(FIELD_INDEX==1 ? "" : LINE"|")$JFS[FIELD_INDEX];
}
for (FIELD_INDEX=1 ; FIELD_INDEX<=NF ; FIELD_INDEX++ ) {
if (!RJFS[FIELD_INDEX]) {
LINE=LINE"\t"$FIELD_INDEX;
}
}
print LINE;
}
example:
input: a b c d e f
output: c|e a b d f
to "un"concatenate:
#!/bin/sh
# nawk -F$'\t' -v JF=3,5 -f unconcatene.awk test
BEGIN { FS="\t";OFS="\t" }
{
if (NR==1) { # to do it only once (NR starts at 1)
N=split(JF,JFS,",");
for (i=1;i<=N;i++) { # reverse it
RJFS[JFS] = i;
}
}
N2=split($1,JFS2,"|"); # N=N2
for (i=1;i<=N;i++) { # reverse it
RJFS[JFS] = JFS2;
}
SIZE=NF-1+N;
FIELD_INDEX=2;
LINE="";
for (NEW_FIELD_INDEX=1 ; NEW_FIELD_INDEX<=SIZE ; NEW_FIELD_INDEX++ ) {
LINE=LINE(NEW_FIELD_INDEX==1 ? "" : "\t");
if (RJFS[NEW_FIELD_INDEX]) {
LINE=(LINE)RJFS[NEW_FIELD_INDEX];
} else {
LINE=(LINE)$FIELD_INDEX;
FIELD_INDEX++;
}
}
print LINE;
}
Thanks!!
example:
input: c|e a b d f
output: a b c d e f
Anthony