awk to print unique text in field

cmccabe · October 8, 2015, 1:17pm

I am trying to use awk to print the unique entries in $2

So in the example below there are 3 lines but 2 of the lines match in $2 so only one is used in the output.

File.txt

chr17:29667512-29667673 NF1:exon.1;NF1:exon.2;NF1:exon.38;NF1:exon.4;NF1:exon.46;NF1:exon.47 703.807
chr16:89877104-89877220 FANCA:exon.4;FANCA:exon.5 159.284
chr16:89838075-89838232 FANCA:exon.23;FANCA:exon.4 583.497

Desired output

NF1
FANCA

awk '!seen[$2]++ {lines[i++]=$2}      END {for (i in lines) if (seen[lines]==1) print lines}' file.txt > output.txt

is close but uses the entire line not the unique text

Current output

NF1:exon.14;NF1:exon.16;NF1:exon.8
NF1:exon.13;NF1:exon.22;NF1:exon.28;NF1:exon.30;NF1:exon.4
CTC1:exon.1;CTC1:exon.16;CTC1:exon.2;CTC1:exon.3;CTC1:exon.5
CTC1:exon.1;CTC1:exon.10;CTC1:exon.2;CTC1:exon.3

Desired output

NF1
CTC1

Thank you

Scrutinizer · October 8, 2015, 2:01pm

You could try something like this:

awk '{n=split($2,F,/[:;]/); for(i=1; i<=n; i+=2) A[F]} END{for(i in A) print i}' file

The idea is that you only use the array for the fields that you want to find. The fields are a subset $2 and here split is used to break it down further.

-- Edit --
It could be shortened so that the end section can be left out:

awk '{n=split($2,F,/[:;]/); for(i=1; i<=n; i+=2) if(!A[F]++) print F}' file

--
Without the split() function, using the field separator to split the fields:

awk -F'[:; ]' '{for(i=3; i<=NF-2; i+=2) if(!A[$i]++) print $i}' file

drl · October 8, 2015, 2:08pm

Hi.

Also with the usual commands:

cut -f2 -d" " file |
tee f1 |
cut -f1 -d":" |
tee f2 |
sort -u

producing:

FANCA
NF1

on a system like:

OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian 5.0.8 (lenny, workstation) 
cut (GNU coreutils) 6.10
sort (GNU coreutils) 6.10

See files f1,f2 for intermediate output.

Best wishes ... cheers, drl

Scrutinizer · October 8, 2015, 2:24pm

@drl, that works if there is always only one kind of label per line...

drl · October 8, 2015, 4:20pm

Hi, Scrutinizer.

Yes, I see what you mean, thanks; however, more kinds of labels were not presented in the sample data. If I have time, I'll think about a solution for that ... cheers, drl

---------- Post updated at 15:20 ---------- Previous update was at 15:10 ----------

Hi.

To correct possible flaw noted by Scrutinizer:

#!/usr/bin/env bash

# @(#) s2	Demonstrate extract unique string from specific field.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C cut tr sort

FILE=${1-data2}

pl " Input data file $FILE:"
cat $FILE

pl " Results:"
cut -f2 -d" " $FILE |
tee f1 |
tr ';' '\n' |
tee f2 |
cut -f1 -d":" |
tee f3 |
sort -u

exit 0

produciing:

$ ./s2

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian 5.0.8 (lenny, workstation) 
bash GNU bash 3.2.39
cut (GNU coreutils) 6.10
tr (GNU coreutils) 6.10
sort (GNU coreutils) 6.10

-----
 Input data file data2:
chr17:29667512-29667673 NF1:exon.1;NF1:exon.2;NF1:exon.38;NF1:exon.4;NF1:exon.46;NF1:exon.47 703.807
chr16:89877104-89877220 FANCA:exon.4;FANCA:exon.5 159.284
chr16:89838075-89838232 FANCA:exon.23;FANCA:exon.4 583.497
chr18:89838075-89838232 DARK:exon.23;FANCA:exon.4 583.497
chr19:89838075-89838232 PARK:exon.23;DARK:exon.4 583.497

-----
 Results:
DARK
FANCA
NF1
PARK

Best wishes ... cheers, drl

cmccabe · October 9, 2015, 7:54am

Thank you both