Complex data sorting in excel files or text files

AAWT · June 6, 2012, 8:50am

Dear all,
I have a complex data file shown below,,,,,

A_ABCD_13208   0   0   4.16735   141044   902449   1293900   168919   
C_ABCD_13208   0   0   4.16735   141044   902449   1293900   168919 
A_ABCDEF715   52410.9   18598.2   10611   10754.7   122535   252426   36631.4 
C_DBCDI_1353   0   26.512   0   93.9469   114151   94382.8   19043.1      
A_DBCDI_1353   0   26.512   0   93.9469   114151   94382.8   19043.1       
C_EFGH_24808   0   0   11.1129   5281.16   108786   146594   49778.1       
A_EFGH_9099   0   0   11.1129   5281.16   108786   146594   49778.1       
C_QRST_9938   0   0   0   2992.88   77887.8   60751.7   5253.41       
A_QRST_9938   0   0   0   2992.88   77887.8   60751.7   5253.41       
A_XVYZ_24808   0   0   0   33505.5   69088.4   167365   90621.9      
C_GHIH_9099   0   0   0   33505.5   69088.4   167365   90621.9      
C_TRST_7849   0   0   22.2259   2107.09   33073.1   42576.2   39891.2       
A_TRST_7849   0   0   22.2259   2107.09   33073.1   42576.2   39891.2       
A_ABCDI_15931   28.998   30.9306   11.1129   17966.2   32947.8   17405.4   3993.58       
A_ABCDI_15930   28.998   30.9306   11.1129   17966.2   32947.8   17405.4   3993.58      
C_GHJK_30564   0   0   0   214.736   30435.4   68135.6   69661.6       
A_GHJK_30564   0   0   0   214.736   30435.4   68135.6   69661.6       
C_STDT_2657   0   0   5.55647   1503.15   27929   101912   63628.2       
A_STDT_2657   0   0   5.55647   1503.15   27929   101912   63628.2

I want to sort to get following information

1: how many are pairs of A and C in the table? (A_ABCD_13208 and C_ABCD_13208)
2: How many only A and only C?
3: As I have many tables , how can I compare two tables for the unique ids in first column and for common ids??
4: Is there any software to create a Venn diagram for seven data sets???

Thanks a lot

joeyg · June 6, 2012, 9:31am

Here are some starts to your first couple questions:

$ cut -d" " -f1 <sample13.txt | cut -c2- | uniq -d
_ABCD_13208
_DBCDI_1353
_QRST_9938
_TRST_7849
_GHJK_30564
_STDT_2657

$ cut -d" " -f1 <sample13.txt | cut -c2- | uniq -u
_ABCDEF715
_EFGH_24808
_EFGH_9099
_XVYZ_24808
_GHIH_9099
_ABCDI_15931
_ABCDI_15930

The first example above shows the duplicated entries. The 2nd example shows unique lines. A 'wc' command could be appended to get a count.

AAWT · June 6, 2012, 9:42am

Hi Joeyg,

its not possible to get new file with A_ and C_ to be sure about pairs and unique.
this command give without A & C.

thanks

joeyg · June 6, 2012, 9:45am

$ cut -d" " -f1 <sample13.txt | sed 's/^[AC]/~/' | uniq -d
~_ABCD_13208
~_DBCDI_1353
~_QRST_9938
~_TRST_7849
~_GHJK_30564
~_STDT_2657

In this example, which would work for #1 and #2 on your list, I change first characters of A or C to ~.

AAWT · June 8, 2012, 3:09am

I am sorry but it is not giving me required result. now it is giving this sign
~ instead of A and C

ygemici · June 8, 2012, 9:35am

## A and C pair count ("A_" line and if next line "C_")
# awk '!/^ *$/{if(w==1){x1=xy;x2=$1}else{if($1)x1=$1;getline p;if(p)x2=substr(p,0,13);else next};if(x1~/^A_/){if(x2~/^C_/){c++;w=0}else{w=1;xy=x2}}}
END{print c}' file

## A record counts
# awk '/^A/{a[x++]=substr(p,0,1)}END{print x}' file

## C record counts
# awk '/^C/{a[x++]=substr(p,0,1)}END{print x}' file

## first column and counts with uniq
# awk '!/^ *$/{!f[$1]++}END{for(i in f)print i,f}' file

regards
ygemici

AAWT · June 8, 2012, 12:44pm

Thanks a lot,,
but its only giving me counts like, 5432, 0r 2345

for A and C
is it possible to get new file with all A or C or A C pairs

Regards

ygemici · June 8, 2012, 3:11pm

like this ?

# awk '!/^ *$/{if(w==1){x1=xy;x2=$1}else{if($1)x1=$1;getline;if($0)x2=$1; else next};if(x1~/^A_/){if(x2~/^C_/){w=0;print x1 FS x2}else{w=1;xy=x2}}
}' file >A_C_pairs
# awk '/^A/{a[x++]=$1}END{for(i=0;i<x;i++)print a}' file >A
# awk '/^C/{a[x++]=$1}END{for(i=0;i<x;i++)print a}' file >C

# cat A
A_ABCD_13208
A_ABCDEF715
A_DBCDI_1353
A_EFGH_9099
A_QRST_9938
A_XVYZ_24808
A_TRST_7849
A_ABCDI_15931
A_ABCDI_15930
A_GHJK_30564
A_STDT_2657

# cat C
C_ABCD_13208
C_DBCDI_1353
C_EFGH_24808
C_QRST_9938
C_GHIH_9099
C_TRST_7849
C_GHJK_30564
C_STDT_2657

# cat A_C_pairs
A_ABCD_13208 C_ABCD_13208
A_ABCDEF715 C_DBCDI_1353
A_DBCDI_1353 C_EFGH_24808
A_EFGH_9099 C_QRST_9938
A_XVYZ_24808 C_GHIH_9099
A_ABCDI_15930 C_GHJK_30564
A_GHJK_30564 C_STDT_2657

AAWT · June 11, 2012, 2:43am

Hi Ygemici,

I got still some problem, in A_C_pairs, I am not getting the pairs of ids, I need to have same id only with A or C at start, but I am getting different As and Cs in file. ?????
As and Cs ids which are in pair should not present in A or C file.
Thanks

ygemici · June 11, 2012, 3:09am

hi
what is your desired output ?

AAWT · June 11, 2012, 3:25am

aawt:

Dear all,
I have a complex data file shown below,,,,,

A_ABCD_13208   0   0   4.16735   141044   902449   1293900   168919   
C_ABCD_13208   0   0   4.16735   141044   902449   1293900   168919 
A_ABCDEF715   52410.9   18598.2   10611   10754.7   122535   252426   36631.4 
C_DBCDI_1353   0   26.512   0   93.9469   114151   94382.8   19043.1      
A_DBCDI_1353   0   26.512   0   93.9469   114151   94382.8   19043.1       
C_EFGH_24808   0   0   11.1129   5281.16   108786   146594   49778.1       
A_EFGH_9099   0   0   11.1129   5281.16   108786   146594   49778.1       
C_QRST_9938   0   0   0   2992.88   77887.8   60751.7   5253.41       
A_QRST_9938   0   0   0   2992.88   77887.8   60751.7   5253.41       
A_XVYZ_24808   0   0   0   33505.5   69088.4   167365   90621.9      
C_GHIH_9099   0   0   0   33505.5   69088.4   167365   90621.9      
C_TRST_7849   0   0   22.2259   2107.09   33073.1   42576.2   39891.2       
A_TRST_7849   0   0   22.2259   2107.09   33073.1   42576.2   39891.2       
A_ABCDI_15931   28.998   30.9306   11.1129   17966.2   32947.8   17405.4   3993.58       
A_ABCDI_15930   28.998   30.9306   11.1129   17966.2   32947.8   17405.4   3993.58      
C_GHJK_30564   0   0   0   214.736   30435.4   68135.6   69661.6       
A_GHJK_30564   0   0   0   214.736   30435.4   68135.6   69661.6       
C_STDT_2657   0   0   5.55647   1503.15   27929   101912   63628.2       
A_STDT_2657   0   0   5.55647   1503.15   27929   101912   63628.2

I want to sort to get following information

1: how many are pairs of A and C in the table? (A_ABCD_13208 and C_ABCD_13208)
2: How many only A and only C?
3: As I have many tables , how can I compare two tables for the unique ids in first column and for common ids??
4: Is there any software to create a Venn diagram for seven data sets???

Thanks a lot

Actually I need three outputs 1:which have all ids in pairs

 A_ABCD_13208  C_ABCD_13208 
A_QRST_9938     C_QRST_9938

2: with all unique C

C_EFGH_24808
C_ABC_1593

with all unique As

 A_ABCDEF715 
A_ABCDI_15930

Thanks

ygemici · June 11, 2012, 5:41am

aawt:

Actually I need three outputs 1:which have all ids in pairs
 A_ABCD_13208  C_ABCD_13208 
A_QRST_9938     C_QRST_9938
2: with all unique C
C_EFGH_24808
C_ABC_1593
with all unique As
 A_ABCDEF715 
A_ABCDI_15930
Thanks

1-) A and C ids in the pairs

# awk '/^A_/{a[$1]++;}/^C_/{c[$1]++}END{for(i in a){split(i,s,"_");for(j in c){split(j,ss,"_");
if(s[3]==ss[3])print i,j}}}' file
A_GHJK_30564 C_GHJK_30564
A_TRST_7849 C_TRST_7849
A_QRST_9938 C_QRST_9938
A_DBCDI_1353 C_DBCDI_1353
A_XVYZ_24808 C_EFGH_24808
A_EFGH_9099 C_GHIH_9099
A_STDT_2657 C_STDT_2657
A_ABCD_13208 C_ABCD_13208

2-) C uniqs (result is nothing because of every C records (ids) have for A ids)

# awk '{x=$1;split($1,s,"_");sub("[^0-9]*","",$1);a[x]=$1 FS s[1];}END{for(i in a){split(a,ss);j=ss[1] FS v2;
if(ss[2]==v1){for(jj in a)if(a[jj]!=j)x++}if(x==NR)print i;x=0}}' file v1="C" v2="A"

3-) A uniqs

# awk '{x=$1;split($1,s,"_");sub("[^0-9]*","",$1);a[x]=$1 FS s[1];}END{for(i in a){split(a,ss);j=ss[1] FS v2;
if(ss[2]==v1){for(jj in a)if(a[jj]!=j)x++}if(x==NR)print i;x=0}}' file v1="A" v2="C"
A_ABCDEF715
A_ABCDI_15930
A_ABCDI_15931

regards
ygemici

AAWT · June 11, 2012, 7:16am

Hi Ygemici,

It is still not giving me desired output,
i don't know where is problem

# awk '/^A_/{a[$1]++;}/^C_/{c[$1]++}END{for(i in a){split(i,s,"_");for(j in c){split(j,ss,"_"); if(s[3]==ss[3])print i,j}}}' file
 A_GHJK_30564 C_GHJK_30564 
A_TRST_7849 C_TRST_7849 
A_QRST_9938 C_QRST_9938 
A_DBCDI_1353 C_DBCDI_1353
 A_XVYZ_24808 C_EFGH_24808 
A_EFGH_9099 C_GHIH_9099 
A_STDT_2657 C_STDT_2657 
A_ABCD_13208 C_ABCD_13208

this command gave me output like

       
A_DBCDI_1353C_GHIH_9099     
 A_DBCDI_1353C_EFGH_24808      
 A_DBCDI_1353C_STDT_2657       
A_DBCDI_1353C_ABCD_13208,,,,,,,,,,,,,,,,

and it is adding same A id for 1000 times and also for C ids ,
its giving me a huge data file with a lot of repetition of records,

looking for help,,,,,,,,

Regards

ygemici · June 11, 2012, 8:13am

aawt:

Hi Ygemici,

It is still not giving me desired output,
i don't know where is problem
# awk '/^A_/{a[$1]++;}/^C_/{c[$1]++}END{for(i in a){split(i,s,"_");for(j in c){split(j,ss,"_"); if(s[3]==ss[3])print i,j}}}' file
 A_GHJK_30564 C_GHJK_30564 
A_TRST_7849 C_TRST_7849 
A_QRST_9938 C_QRST_9938 
A_DBCDI_1353 C_DBCDI_1353
 A_XVYZ_24808 C_EFGH_24808 
A_EFGH_9099 C_GHIH_9099 
A_STDT_2657 C_STDT_2657 
A_ABCD_13208 C_ABCD_13208
this command gave me output like
   
A_DBCDI_1353C_GHIH_9099     
 A_DBCDI_1353C_EFGH_24808      
 A_DBCDI_1353C_STDT_2657       
A_DBCDI_1353C_ABCD_13208,,,,,,,,,,,,,,,,   
and it is adding same A id for 1000 times and also for C ids ,
its giving me a huge data file with a lot of repetition of records,

looking for help,,,,,,,,

Regards

what is your O.S ?
your full input file ? (can you attach it ?)

AAWT · June 11, 2012, 8:17am

I am sorry,,,O.S abbreviate for??
I'll try to send you the input file.

Regards

ygemici · June 11, 2012, 10:50am

Hi Abdul, I looked at your file, try this

# awk 'NR!=1{y=$1;gsub("[0-9]*","",y);split($1,s,"_");gsub("[^0-9]*","",$1);a[c++]=y FS $1}END{for(j=0;j<c;j++){
y=split(a[j],s,"[_ ]*");for(i=0;i<c;i++){x=split(a,ss,"[_ ]*");if(s[y]==ss[x]&&s[1]==v2&&ss[1]==v1){
sub(" ","",a[j]);sub(" ","",a);print a[j],a;}}}}' yourfile v1="C" v2="A"
A_JCVI_13208 C_JCVI_13208
A_JCVI_1353 C_JCVI_1353
A_JCVI_9099 C_JCVI_9099
A_JCVI_9938 C_JCVI_9938
A_JCVI_24808 C_JCVI_24808
A_DN191394 C_DN191394
A_JCVI_7849 C_JCVI_7849
A_JCVI_30564 C_JCVI_30564
A_JCVI_2657 C_JCVI_2657
A_AM062153 C_AM062153
A_JCVI_3665 C_JCVI_3665
A_JCVI_36467 C_JCVI_36467
............................
............................

regards
ygemici

AAWT · June 12, 2012, 5:37am

Dear Ygemici,
thanks a lot for your effort,,,,,but there is still some problem
check line 4 and 5, that are not pairs ,,,these are still not giving the required output.

A_DY000990 C_JCVI_990
A_EX091309 C_EX091309
A_AM388645 C_AM388645
A_AM386941 C_RC_AM386941
A_AB012650 C_JCVI_12650
A_DN191394 C_DN191394

Is it possible to get in this form in a new text file,,,,if possible with all columns of values also.

A_DY000990 
C_JCVI_990
A_EX091309 
C_EX091309
A_AM388645 
C_AM388645

sorry but i am really not good to make changes by myself in command

thanks a lot

Regards

ygemici · June 13, 2012, 4:01am

aawt:

Dear Ygemici,
thanks a lot for your effort,,,,,but there is still some problem
check line 4 and 5, that are not pairs ,,,these are still not giving the required output.
A_DY000990 C_JCVI_990
A_EX091309 C_EX091309
A_AM388645 C_AM388645
A_AM386941 C_RC_AM386941
A_AB012650 C_JCVI_12650
A_DN191394 C_DN191394 
Is it possible to get in this form in a new text file,,,,if possible with all columns of values also.
A_DY000990 
C_JCVI_990
A_EX091309 
C_EX091309
A_AM388645 
C_AM388645
sorry but i am really not good to make changes by myself in command

thanks a lot

Regards

Hi abdul, your output is different for me..
Because of your input file does not containing some outpus ( in your file that you send to my mail )
look

# grep C_RC_AM386941 abdulfile
# grep C_JCVI_12650 abdulfile
# grep C_JCVI_990 abdulfile

# grep 191394 abdulfile
A_DN191394      0       0       16.6694 23272   38590.3 18954.3 315.157
C_DN191394      0       0       11.1129 18776   34445.7 27008.5 13766.6

if you sent the full data(pls rar/zip the full text and send) to my mail i re-try to this

AAWT · June 13, 2012, 4:18am

Dear Ygemici,

I sent you the whole set.
Please find in your email.
Regards

ygemici · June 13, 2012, 4:17pm

ok i got it
i added a conditon to code..lets try this..(code still working )

# awk 'NR!=1{y=$1;gsub("[0-9]*","",y);split($1,s,"_");gsub("[^0-9]*","",$1);a[c++]=y FS $1}END{for(j=0;j<c;j++){
y=split(a[j],s,"[_ ]*");for(i=2;i<y;i++)yy=yy?yy FS s:s;for(i=0;i<c;i++){x=split(a,ss,"[_ ]*");
for(k=2;k<x;k++)xx=xx?xx FS ss[k]:ss[k];;if(s[y]==ss[x]&&s[1]==v2&&ss[1]==v1&&xx==yy){
sub(" ","",a[j]);sub(" ","",a);print a[j];print a;}xx=""}yy=""}}' abdulfull_file v1="C" v2="A"
A_JCVI_13208
C_JCVI_13208
A_JCVI_1353
C_JCVI_1353
....................
....................
....................

I m try to code at the now (for A_ and C_ uniqs)
When the test finishes (your file is 89110 lines so it takes a while), then i write the code..

---------- Post updated at 11:17 PM ---------- Previous update was at 02:12 PM ----------

and the others (uniqs)

2-) only A_ specific records

# awk 'NR!=1{y=$1;gsub("[0-9]*","",y);split($1,s,"_");gsub("[^0-9]*","",$1);a[c++]=y FS $1}END{for(j=0;j<c;j++){
y=split(a[j],s,"[_ ]*");if(s[1]==v2){ll=a[j];for(i=2;i<y;i++)yy=yy?yy FS s:s;for(i=0;i<c;i++){x=split(a,ss,"[_ ]*");
for(k=2;k<x;k++)xx=xx?xx FS ss[k]:ss[k];;if(s[y]==ss[x]&&xx==yy){if(s[1]==v2&&ss[1]==v1){ac_uniq++;};
if(s[1]==v2&&ss[1]==v2){aa_uniq++}}xx="";};if(ac_uniq==0){gsub(FS,"",ll);b[ll];};yy="";;aa_uniq=ac_uniq=0}};for(uniqa in b)print uniqa}' 
abdulfile v1="C" v2="A" >uniq_A

3-) only C_ specific records

# awk 'NR!=1{y=$1;gsub("[0-9]*","",y);split($1,s,"_");gsub("[^0-9]*","",$1);a[c++]=y FS $1}END{for(j=0;j<c;j++){
y=split(a[j],s,"[_ ]*");if(s[1]==v1){ll=a[j];for(i=2;i<y;i++)yy=yy?yy FS s:s;for(i=0;i<c;i++){x=split(a,ss,"[_ ]*");
for(k=2;k<x;k++)xx=xx?xx FS ss[k]:ss[k];;if(s[y]==ss[x]&&xx==yy){if(s[1]==v1&&ss[1]==v2){ac_uniq++;};
if(s[1]==v1&&ss[1]==v1){cc_uniq++}}xx="";};if(ac_uniq==0){gsub(FS,"",ll);b[ll];};yy="";ac_uniq=cc_uniq=0}};for(uniqc in b)print uniqc}' 
abdulfile v1="C" v2="A" >uniq_C

regards
ygemici