Dear experts,
I am a relative novice in the Unix and came across a very useful code that I regularly use for my research blindly. I am wondering if any of the professional members could kindly briefly explain to me what the code actually does?
Many thanks in advance
The script is
awk '!(a[$1]) {a[$1]=$0; next} a[$1] {w=$1; $1=""; a[w]=a[w] $0} END {for (i in a) print a}' FS="\t" OFS="\t" A.txt
The original title is : Find duplicates in column 1 and merge their lines (awk?)
awk '
!(a[$1]) {a[$1]=$0 # if array element indexed by $1 is unset or 0, set it to
# the line (i.e. collect first occurrences of $1)
next # continue with the next input line
}
a[$1] {w = $1 # if set, save $1 in temp variable
$1 = "" # and remove it (but leave FS intact)
a[w] = a[w] $0 # then append line to resp. array element
}
END {for (i in a) print a # print all elements containing collected lines
# be aware that the order of elements is unspecified
}
' FS="\t" OFS="\t" file
Please note how consistent structuring (e.g. indentation) of the code helps in reading / understanding / seeing patterns in it.
A comment:
the existence test a[$1] can give different results on different awk versions, and also it adds an empty array element if there was none.
Better is the test ($1 in a) .
I think one should recode the whole thing:
awk '{i=$1; $1=""; a=(a $0)} END {for (i in a) print (i a)}' FS="\t" OFS="\t" A.txt
This version stores the $1 (field #1) only as an index, not as a value. Therefore, at the END the index is printed before the value.