Hello,
I have a script that is generating a tab delimited output file.
num Name PCA_A1 PCA_A2 PCA_A3
0 compound_00 -3.5054 -1.1207 -2.4372
1 compound_01 -2.2641 0.4287 -1.6120
3 compound_03 -1.3053 1.8495 -1.0224
0 compound_00 -3.5054 -1.1207 -2.4372
4 compound_04 -1.1845 -0.3377 -2.9453
7 compound_07 -0.2988 1.3539 -1.6114
8 compound_08 2.6872 -1.3726 -5.9732
9 compound_09 -1.4546 -0.8284 -3.5016
4 compound_04 -1.1845 -0.3377 -2.9453
7 compound_07 -0.2988 1.3539 -1.6114
8 compound_08 2.6872 -1.3726 -5.9732
I need to trim this down so that there a no duplicates in the first column. Actually, the entire row would be a duplicate, but I don't see any reason to look at anything other than the index value. There is no particular rational to the order and there could be any number of duplicates of a given row.
The final results should look like this,
num Name PCA_A1 PCA_A2 PCA_A3
0 compound_00 -3.5054 -1.1207 -2.4372
1 compound_01 -2.2641 0.4287 -1.6120
3 compound_03 -1.3053 1.8495 -1.0224
4 compound_04 -1.1845 -0.3377 -2.9453
7 compound_07 -0.2988 1.3539 -1.6114
8 compound_08 2.6872 -1.3726 -5.9732
9 compound_09 -1.4546 -0.8284 -3.5016
I need one, and only one, instance of each index value ("num" column value) in the file, not just the lines with num values that appear only once. There always seems to be some confusion about that with discussions of "unique" lines.
The only thing I could think of was to sort the rows on the num column value and then loop through checking if the num value was equal to the previous line. If it is not equal, copy it to a new array, etc.
Any suggestions? There always seems to be some simple one line solution that I don't know about.
LMHmedchem
Hello LMHmedchem,
Could you please try following and let me know if this helps you.
awk 'NR==1{print;next} {A[$1]=$0;C=C<$1?$1:C} END{;for(i=0;i<=C;i++){if(A){print A}}}' Input_file
Output will be as follows.
num Name PCA_A1 PCA_A2 PCA_A3
0 compound_00 -3.5054 -1.1207 -2.4372
1 compound_01 -2.2641 0.4287 -1.6120
3 compound_03 -1.3053 1.8495 -1.0224
4 compound_04 -1.1845 -0.3377 -2.9453
7 compound_07 -0.2988 1.3539 -1.6114
8 compound_08 2.6872 -1.3726 -5.9732
9 compound_09 -1.4546 -0.8284 -3.5016
Thanks,
R. Singh
RudiC
July 23, 2016, 3:15am
3
It is always worthwhile to comb through these forums for similar problems and their solutions; five examples are usually offered at the bottom of this page (of which at least three solve your problem), and more may be available, helping you to help yourself.
Anyway, try
awk '!T[$1]++' file
num Name PCA_A1 PCA_A2 PCA_A3
0 compound_00 -3.5054 -1.1207 -2.4372
1 compound_01 -2.2641 0.4287 -1.6120
3 compound_03 -1.3053 1.8495 -1.0224
4 compound_04 -1.1845 -0.3377 -2.9453
7 compound_07 -0.2988 1.3539 -1.6114
8 compound_08 2.6872 -1.3726 -5.9732
9 compound_09 -1.4546 -0.8284 -3.5016
drl
July 23, 2016, 9:03am
4
Hi.
Seems like sort
with unique option works for me:
#!/usr/bin/env bash
# @(#) s1 Demonstrate remove all identical lines, sort.
# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C sort pass-fail
FILE=${1-data1}
pl " Input data file $FILE:"
head $FILE
pl " Expected output:"
head expected-output.txt
pl " Results:"
sort -u -k1,1 $FILE |
tee f1
pass-fail f1 expected-output.txt
exit 0
producing
$ ./s1
Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution : Debian 8.4 (jessie)
bash GNU bash 4.3.30
sort (GNU coreutils) 8.23
pass-fail - ( local: RepRev 1.6, ~/bin/pass-fail, 2016-07-23 )
-----
Input data file data1:
0 compound_00 -3.5054 -1.1207 -2.4372
1 compound_01 -2.2641 0.4287 -1.6120
3 compound_03 -1.3053 1.8495 -1.0224
0 compound_00 -3.5054 -1.1207 -2.4372
4 compound_04 -1.1845 -0.3377 -2.9453
7 compound_07 -0.2988 1.3539 -1.6114
8 compound_08 2.6872 -1.3726 -5.9732
9 compound_09 -1.4546 -0.8284 -3.5016
4 compound_04 -1.1845 -0.3377 -2.9453
7 compound_07 -0.2988 1.3539 -1.6114
-----
Expected output:
0 compound_00 -3.5054 -1.1207 -2.4372
1 compound_01 -2.2641 0.4287 -1.6120
3 compound_03 -1.3053 1.8495 -1.0224
4 compound_04 -1.1845 -0.3377 -2.9453
7 compound_07 -0.2988 1.3539 -1.6114
8 compound_08 2.6872 -1.3726 -5.9732
9 compound_09 -1.4546 -0.8284 -3.5016
-----
Results:
0 compound_00 -3.5054 -1.1207 -2.4372
1 compound_01 -2.2641 0.4287 -1.6120
3 compound_03 -1.3053 1.8495 -1.0224
4 compound_04 -1.1845 -0.3377 -2.9453
7 compound_07 -0.2988 1.3539 -1.6114
8 compound_08 2.6872 -1.3726 -5.9732
9 compound_09 -1.4546 -0.8284 -3.5016
-----
Comparison of 7 created lines with 7 lines of desired results:
Succeeded -- files (computed) f1 and (standard) expected-output.txt have same content.
The pass-fail
code is basically just a wrapper around cmp
for some extra checking and reporting.
Best wishes ... cheers, drl