I am trying to use awk
to print the unique entries in $2
So in the example below there are 3 lines but 2 of the lines match in $2
so only one is used in the output.
File.txt
chr17:29667512-29667673 NF1:exon.1;NF1:exon.2;NF1:exon.38;NF1:exon.4;NF1:exon.46;NF1:exon.47 703.807
chr16:89877104-89877220 FANCA:exon.4;FANCA:exon.5 159.284
chr16:89838075-89838232 FANCA:exon.23;FANCA:exon.4 583.497
Desired output
NF1
FANCA
awk '!seen[$2]++ {lines[i++]=$2} END {for (i in lines) if (seen[lines]==1) print lines}' file.txt > output.txt
is close but uses the entire line not the unique text
Current output
NF1:exon.14;NF1:exon.16;NF1:exon.8
NF1:exon.13;NF1:exon.22;NF1:exon.28;NF1:exon.30;NF1:exon.4
CTC1:exon.1;CTC1:exon.16;CTC1:exon.2;CTC1:exon.3;CTC1:exon.5
CTC1:exon.1;CTC1:exon.10;CTC1:exon.2;CTC1:exon.3
Desired output
NF1
CTC1
Thank you
You could try something like this:
awk '{n=split($2,F,/[:;]/); for(i=1; i<=n; i+=2) A[F]} END{for(i in A) print i}' file
The idea is that you only use the array for the fields that you want to find. The fields are a subset $2 and here split is used to break it down further.
-- Edit --
It could be shortened so that the end section can be left out:
awk '{n=split($2,F,/[:;]/); for(i=1; i<=n; i+=2) if(!A[F]++) print F}' file
--
Without the split() function, using the field separator to split the fields:
awk -F'[:; ]' '{for(i=3; i<=NF-2; i+=2) if(!A[$i]++) print $i}' file
1 Like
drl
October 8, 2015, 2:08pm
3
Hi.
Also with the usual commands:
cut -f2 -d" " file |
tee f1 |
cut -f1 -d":" |
tee f2 |
sort -u
producing:
FANCA
NF1
on a system like:
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution : Debian 5.0.8 (lenny, workstation)
cut (GNU coreutils) 6.10
sort (GNU coreutils) 6.10
See files f1,f2 for intermediate output.
Best wishes ... cheers, drl
1 Like
@drl , that works if there is always only one kind of label per line...
1 Like
drl
October 8, 2015, 4:20pm
5
Hi, Scrutinizer.
Yes, I see what you mean, thanks; however, more kinds of labels were not presented in the sample data. If I have time, I'll think about a solution for that ... cheers, drl
---------- Post updated at 15:20 ---------- Previous update was at 15:10 ----------
Hi.
To correct possible flaw noted by Scrutinizer:
#!/usr/bin/env bash
# @(#) s2 Demonstrate extract unique string from specific field.
# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C cut tr sort
FILE=${1-data2}
pl " Input data file $FILE:"
cat $FILE
pl " Results:"
cut -f2 -d" " $FILE |
tee f1 |
tr ';' '\n' |
tee f2 |
cut -f1 -d":" |
tee f3 |
sort -u
exit 0
produciing:
$ ./s2
Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution : Debian 5.0.8 (lenny, workstation)
bash GNU bash 3.2.39
cut (GNU coreutils) 6.10
tr (GNU coreutils) 6.10
sort (GNU coreutils) 6.10
-----
Input data file data2:
chr17:29667512-29667673 NF1:exon.1;NF1:exon.2;NF1:exon.38;NF1:exon.4;NF1:exon.46;NF1:exon.47 703.807
chr16:89877104-89877220 FANCA:exon.4;FANCA:exon.5 159.284
chr16:89838075-89838232 FANCA:exon.23;FANCA:exon.4 583.497
chr18:89838075-89838232 DARK:exon.23;FANCA:exon.4 583.497
chr19:89838075-89838232 PARK:exon.23;DARK:exon.4 583.497
-----
Results:
DARK
FANCA
NF1
PARK
Best wishes ... cheers, drl
1 Like