Joining files in a complex way

stateperl · March 5, 2010, 7:11pm

if input1 1st row labels (S1or S2 or S3 or any (actually so many in original text file)) are similar to 1st column of input2 i.e "ID" merge them together based on input1 1st row labels.
for example take S1.....

input1

"aphab"    "S1"    "S2"    "S3"
"a"    "A/A"    "A/A"    "A/A"
"b"    "A/G"    "A/G"    "A/A"
"c"    "A/A"    "G/G"    "A/A"
"d"    "G/G"    "A/G"    "A/G"
"e"    "A/G"    "G/G"    "A/G"
"f"     "A/A"    "G/G"    "A/G"
"g"    "A/A"    "G/G"    "G/G"
"h"    "A/A"    "G/G"    "G/G"
"I"     "A/A"    "G/G"    "G/G"

input2

"ID"    "Label"    "log"
"S1"    "xxx"    2.8
"S1"    "xxx"    3
"S1"    "xxx"    4
"S2"    "yyy"    6.8
"S2"    "yyy"    7
"S2"    "yyy"    7.4
"S2"    "yyy"    8
"S3"    "zzz"    12
"S3"    "zzz"    14
"S3"    "zzz"    16
"S3"    "zzz"    18
"S3"    "zzz"    20

output

"ID"    "Label"    "StYPE"    "Ntype"    "Stype_No"    "log"
"S1"    "xxx"    "A/A"    1    6    2.8
"S1"    "xxx"    "A/G"    2    2    3
"S1"    "xxx"    "G/G"    3    1    4
"S2"    "yyy"    "A/A"    1    1    6.8
"S2"    "yyy"    "A/G"    2    2    7
"S2"    "yyy"    "G/G"    3    6    7.4
"S2"    "yyy"    "NULL"    "null"    "null"    8
"S3"    "zzz"    "A/A"    1    3    12
"S3"    "zzz"    "A/G"    2    3    14
"S3"    "zzz"    "G/G"    3    3    16
"S3"    "zzz"    "NULL"    "null"    "null"    18
"S3"    "zzz"    "NULL"    "null"    "null"    20

4th column just prints 1 for A/A, 2 for A/G and 3 for G/G alphabets.(pink bold letters)
5th column in output.csv represnts number of time the alphabet corresponds specific label (S1-A/A=6 times, S1-A/G= 2 times and so on) repeated in input1.csv
6th column is just the corresponding S1/S2/S3 log values from input2. (S1 has 2.8,3 and 4)

Note: Null values are because of excess log values i.e there are log values but no Stype or Ntype etc... (S2 and S3 has excess log values and you can see them as null and logvalues in output)

Thanx in advance
Pearl

durden_tyler · March 5, 2010, 9:16pm

stateperl:

2 input files. input1.csv and input2.csv....

input2. csv 1st column merge with 2nd.3rd and 4th columns in input1.csv based on S1/S2/S3 and takes the corresponding alphabets from the input1.csv A/A or A/G or G/G.

So that give ID (S1/S2/S3), Label (xxx/yyy/zzz) and alphabets ( A/A or A/G or G/G) in output.csv ....1st,2nd,3rd columns.

4th column just prints 1 for A/A, 2 for A/G and 3 for G/G alphabets.
5th column in output.csv represnts number of time the alphabet repeated in input1.csv
6th column is just the corresponding log values from input2.

Thanx in advance
Pearl
oem@mintibm ~/Desktop/Temp_SNP $ cat input1.csv 
"aphab"    "S1"    "S2"    "S3"
"a"    "A/A"    "A/A"    "A/A"
"b"    "A/G"    "A/G"    "A/A"
"c"    "A/A"    "G/G"    "A/A"
"d"    "G/G"    "A/G"    "A/G"
"e"    "A/G"    "G/G"    "A/G"
"f"    "A/A"    "G/G"    "A/G"
"g"    "A/A"    "G/G"    "G/G"
"h"    "A/A"    "G/G"    "G/G"
"I"    "A/A"    "G/G"    "G/G"
oem@mintibm ~/Desktop/Temp_SNP $ cat input2.csv 
"ID"    "Label"    "log"
"S1"    "xxx"    2.8
"S1"    "xxx"    3
"S1"    "xxx"    4
"S2"    "yyy"    6.8
"S2"    "yyy"    7
"S2"    "yyy"    7.4
"S2"    "yyy"    8
"S3"    "zzz"    12
"S3"    "zzz"    14
"S3"    "zzz"    16
"S3"    "zzz"    18
"S3"    "zzz"    20
oem@mintibm ~/Desktop/Temp_SNP $ cat output_result.csv 
"ID"    "Label"    "StYPE"    "Ntype"    "Stype_No"    "log"
"S1"    "xxx"    "A/A"    1    6    2.8
"S1"    "xxx"    "A/G"    2    2    3
"S1"    "xxx"    "G/G"    3    1    4
"S2"    "yyy"    "A/A"    1    1    6.8
"S2"    "yyy"    "A/G"    2    2    7
"S2"    "yyy"    "G/G"    3    6    7.4
"S2"    "yyy"    "NULL"    "null"    "null"    8
"S3"    "zzz"    "A/A"    1    3    12
"S3"    "zzz"    "A/G"    2    3    14
"S3"    "zzz"    "G/G"    3    3    16
"S3"    "zzz"    "NULL"    "null"    "null"    18
"S3"    "zzz"    "NULL"    "null"    "null"    20

I don't think it's clear enough.

1) Are you doing a line-by-line comparison ? If yes, then what do you compare lines 10, 11 and 12 of input2.csv with ?

2) The following is not clear -

Let me take the first line of data in input2.csv -

"S1"    "xxx"    2.8

This is the output line you want -

"S1"    "xxx"    "A/A"    1    6    2.8

2a) If that "A/A" is due to "S1", then why is it not "A/A" in line 2 of output ?

2b) How did you get 6 for "Stype_No" ? If it is number of times "A" is repeated in line 1 of input1.csv, then what do you when you pick up "A/G" ?

Please take a the first few lines of input2.csv and explain how you got each of those fields in the output csv, or at least the fields - "Stype" and "Stype_No".

tyler_durden

binlib · March 5, 2010, 9:34pm

awk '
BEGIN {
  a[1] = "\"A/A\""
  a[2] = "\"A/G\""
  a[3] = "\"G/G\""
  OFS = "\t"
}
NR == FNR {
  if (NR == 1) {
    for (i = 2; i <= NF; ++i)
      s = $i
    next
  }
  for (i = 2; i <= NF; ++i)
    ++b[s, $i]
  next
}
{
  $6 = $3
  if (FNR == 1) {
    $3 = "\"StYPE\""
    $4 = "\"Ntype\""
    $5 = "\"Stype_No\""
  } else {
    if ($1 == last) ++i
    else { last = $1; i = 1 }
    if (i in a) {
      $3 = a
      $4 = i
      $5 = b[$1, a]
    } else {
      $3 = "\"NULL\""
      $5 = $4 = "\"null\""
    }
  }
}
1
' input1.csv input2.csv

stateperl · March 5, 2010, 9:58pm

@Tyler:::::Thank you for pointing it out. I edited the post above to explain the question better. Please take a look.
@binlib:::Amazing and clear code. Thanks alot! but a small bug in it. It's giving unnecessary values at the end (bold)
And one more thing. For suppose if I have multiple input2 files (with same IDs but different Stypes and log values ) and a single input1, is this code works the same??? if not could you please suggest me..
####
I tested with multiple input2files, its working but giving 2 separate outputs (outpu1 and output2). What I need is a single output (common output that adds all Ntypes together)

thanx
Pearl.

output

oem@mintibm ~/Desktop/Temp_SNP $ cat output.txt 
"ID"	"Label"	"StYPE"	"Ntype"	"Stype_No"	"log"
"S1"	"xxx"	"A/A"	1	6	2.8
"S1"	"xxx"	"A/G"	2	2	3
"S1"	"xxx"	"G/G"	3	1	4
"S2"	"yyy"	"A/A"	1	1	6.8
"S2"	"yyy"	"A/G"	2	2	7
"S2"	"yyy"	"G/G"	3	6	7.4
"S2"	"yyy"	"NULL"	"null"	"null"	8
"S3"	"zzz"	"A/A"	1	3	12
"S3"	"zzz"	"A/G"	2	3	14
"S3"	"zzz"	"G/G"	3	3	16
"S3"	"zzz"	"NULL"	"null"	"null"	18
"S3"	"zzz"	"NULL"	"null"	"null"	20
		"A/A"	1

durden_tyler · March 6, 2010, 1:47am

Here's a Perl solution for this problem -

$ 
$ 
$ cat input1.csv
"aphab"    "S1"    "S2"    "S3"
"a"    "A/A"    "A/A"    "A/A"
"b"    "A/G"    "A/G"    "A/A"
"c"    "A/A"    "G/G"    "A/A"
"d"    "G/G"    "A/G"    "A/G"
"e"    "A/G"    "G/G"    "A/G"
"f"     "A/A"    "G/G"    "A/G"
"g"    "A/A"    "G/G"    "G/G"
"h"    "A/A"    "G/G"    "G/G"
"I"     "A/A"    "G/G"    "G/G"
$ 
$ cat input2.csv
"ID"    "Label"    "log"
"S1"    "xxx"    2.8
"S1"    "xxx"    3
"S1"    "xxx"    4
"S2"    "yyy"    6.8
"S2"    "yyy"    7
"S2"    "yyy"    7.4
"S2"    "yyy"    8
"S3"    "zzz"    12
"S3"    "zzz"    14
"S3"    "zzz"    16
"S3"    "zzz"    18
"S3"    "zzz"    20
$ 
$ cat combine.pl
#!/usr/bin/perl -w

my $infile1 = "input1.csv";
my @infile2 = qw(input2.csv);
my $outfile = "output.csv";

# define hashes - %chartonum, %numtochar and %mainhash
my %chartonum = qw(A/A 1 A/G 2 G/G 3);
my %numtochar = qw(1 A/A 2 A/G 3 G/G);
my %mainhash;

# first process input1.csv
open(INFILE, $infile1) or die "Can't open $infile1: $!";
while (<INFILE>) {
  chomp;
  s/"//g;
  s/[ ]+/ /g;
  if ($. == 1) {
    @x = split/ /;
  } else {
    @y = split/ /;
    foreach $i (1..$#y) {
      $mainhash{$x[$i].",".$chartonum{$y[$i]}}++;
    }
  }
}
close(INFILE) or die "Can't close $infile1: $!";

# print the header
printf("%-12s%-12s%-12s%-12s%-12s%-s\n","\"ID\"","\"Label\"","\"StYPE\"","\"Ntype\"","\"Stype_No\"","\"log\"");
# now start processing the set of input2.csv files
foreach $file2 (@infile2) {
  # open $file2
  open(INFILE, $file2) or die "Can't open $file2: $!";
  while (<INFILE>) {
    if ($. > 1) {
      chomp;
      s/"//g;
      s/[ ]+/ /g;
      # print $_,"\n";
      @z = split/ /;
      if (!defined $prev or $z[0] ne $prev) {$num = 1} else {$num++};
      $prev = $z[0];
      printf("%-12s%-12s%-12s%-12s%-12s%-s\n",
             "\"$z[0]\"",
             "\"$z[1]\"",
             defined $numtochar{$num} ? "\"$numtochar{$num}\"" : "\"NULL\"",
             exists $numtochar{$num} ? $num : "\"null\"",
             defined $mainhash{$z[0].",".$num} ? $mainhash{$z[0].",".$num} : "\"null\"", 
             $z[2]
            );
    }
  }
  close(INFILE) or die "Can't close $file2: $!";
}

$ 
$ perl combine.pl
"ID"        "Label"     "StYPE"     "Ntype"     "Stype_No"  "log"
"S1"        "xxx"       "A/A"       1           6           2.8
"S1"        "xxx"       "A/G"       2           2           3
"S1"        "xxx"       "G/G"       3           1           4
"S2"        "yyy"       "A/A"       1           1           6.8
"S2"        "yyy"       "A/G"       2           2           7
"S2"        "yyy"       "G/G"       3           6           7.4
"S2"        "yyy"       "NULL"      "null"      "null"      8
"S3"        "zzz"       "A/A"       1           3           12
"S3"        "zzz"       "A/G"       2           3           14
"S3"        "zzz"       "G/G"       3           3           16
"S3"        "zzz"       "NULL"      "null"      "null"      18
"S3"        "zzz"       "NULL"      "null"      "null"      20
$ 
$

Show an example of how the output would be affected in case of multiple input2.csv files.

While ID and Label may be the same, the values of "log" could be different in different input2.csv files.
The values of Stype, Ntype, Stype_No would remain the same since they depend on input1.csv file. What do you want to do with multiple "log" values then ?
If we just append the records of the next input2.csv, then I'd think the values of Stype, Ntype and Stype_No would be NULL, since they show up only for the first 3 rows.

tyler_durden

stateperl · March 6, 2010, 3:37am

@Tyler: Thanx for perl script. It looking bit scary to me but as smart as awk.
**********************************************************************************************

It's multiple input1 files and a single input2 file. In this we need to sum up all the Stype_No from input1a,b,c files.Mentioned in bold. so it should work like this.
Note; To make it easy for you I just used same copies of input1 files (a,b and c) but in real cases the number od A/A or others may vary and also the number of input1 files may be greater than just 3 (it could be input a,b,c,d or ....).

perl script.pl input1a input1b input1c input2 >>output

input1a

"aphab"    "S1"    "S2"    "S3"
"a"    "A/A"    "A/A"    "A/A"
"b"    "A/G"    "A/G"    "A/A"
"c"    "A/A"    "G/G"    "A/A"
"d"    "G/G"    "A/G"    "A/G"
"e"    "A/G"    "G/G"    "A/G"
"f"     "A/A"    "G/G"    "A/G"
"g"    "A/A"    "G/G"    "G/G"
"h"    "A/A"    "G/G"    "G/G"
"I"     "A/A"    "G/G"    "G/G"

input1b

"aphab"    "S1"    "S2"    "S3"
"a"    "A/A"    "A/A"    "A/A"
"b"    "A/G"    "A/G"    "A/A"
"c"    "A/A"    "G/G"    "A/A"
"d"    "G/G"    "A/G"    "A/G"
"e"    "A/G"    "G/G"    "A/G"
"f"     "A/A"    "G/G"    "A/G"
"g"    "A/A"    "G/G"    "G/G"
"h"    "A/A"    "G/G"    "G/G"
"I"     "A/A"    "G/G"    "G/G"

input1c

"aphab"    "S1"    "S2"    "S3"
"a"    "A/A"    "A/A"    "A/A"
"b"    "A/G"    "A/G"    "A/A"
"c"    "A/A"    "G/G"    "A/A"
"d"    "G/G"    "A/G"    "A/G"
"e"    "A/G"    "G/G"    "A/G"
"f"     "A/A"    "G/G"    "A/G"
"g"    "A/A"    "G/G"    "G/G"
"h"    "A/A"    "G/G"    "G/G"
"I"     "A/A"    "G/G"    "G/G"

input2

"ID"    "Label"    "log"
"S1"    "xxx"    2.8
"S1"    "xxx"    3
"S1"    "xxx"    4
"S2"    "yyy"    6.8
"S2"    "yyy"    7
"S2"    "yyy"    7.4
"S2"    "yyy"    8
"S3"    "zzz"    12
"S3"    "zzz"    14
"S3"    "zzz"    16
"S3"    "zzz"    18
"S3"    "zzz"    20

ouput

"ID"        "Label"     "StYPE"     "Ntype"     "Stype_No"  "log"
"S1"        "xxx"       "A/A"       1           18           2.8
"S1"        "xxx"       "A/G"       2           6           3
"S1"        "xxx"       "G/G"       3           3           4
"S2"        "yyy"       "A/A"       1           3           6.8
"S2"        "yyy"       "A/G"       2           6           7
"S2"        "yyy"       "G/G"       3           18          7.4
"S2"        "yyy"       "NULL"      "null"      "null"      8
"S3"        "zzz"       "A/A"       1           9           12
"S3"        "zzz"       "A/G"       2           9           14
"S3"        "zzz"       "G/G"       3           9           16
"S3"        "zzz"       "NULL"      "null"      "null"      18
"S3"        "zzz"       "NULL"      "null"      "null"      20

durden_tyler · March 6, 2010, 10:08am

Thanks for the explanation and example.
The base code remains the same; I've added the capability to accept arguments from command line, fill up an array for "input1" files and process each file in it i.e. each array element.

$ 
$ cat input1a
"aphab"    "S1"    "S2"    "S3"
"a"    "A/A"    "A/A"    "A/A"
"b"    "A/G"    "A/G"    "A/A"
"c"    "A/A"    "G/G"    "A/A"
"d"    "G/G"    "A/G"    "A/G"
"e"    "A/G"    "G/G"    "A/G"
"f"     "A/A"    "G/G"    "A/G"
"g"    "A/A"    "G/G"    "G/G"
"h"    "A/A"    "G/G"    "G/G"
"I"     "A/A"    "G/G"    "G/G"
$ 
$ cat input1b
"aphab"    "S1"    "S2"    "S3"
"a"    "A/A"    "A/A"    "A/A"
"b"    "A/G"    "A/G"    "A/A"
"c"    "A/A"    "G/G"    "A/A"
"d"    "G/G"    "A/G"    "A/G"
"e"    "A/G"    "G/G"    "A/G"
"f"     "A/A"    "G/G"    "A/G"
"g"    "A/A"    "G/G"    "G/G"
"h"    "A/A"    "G/G"    "G/G"
"I"     "A/A"    "G/G"    "G/G"
$ 
$ cat input1c
"aphab"    "S1"    "S2"    "S3"
"a"    "A/A"    "A/A"    "A/A"
"b"    "A/G"    "A/G"    "A/A"
"c"    "A/A"    "G/G"    "A/A"
"d"    "G/G"    "A/G"    "A/G"
"e"    "A/G"    "G/G"    "A/G"
"f"     "A/A"    "G/G"    "A/G"
"g"    "A/A"    "G/G"    "G/G"
"h"    "A/A"    "G/G"    "G/G"
"I"     "A/A"    "G/G"    "G/G"
$ 
$ cat combine.pl
#!/usr/bin/perl -w

# check that at least 2 arguments are passed to this program
# exit with error code 1 otherwise
if ($#ARGV < 1) {
  print "Usage:   perl combine.pl <list of input files separated by space> input2\n";
  print "Example: perl combine.pl input1a input1b input1c input2\n";
  exit 1;
}

# now assign the list of "input1" file names to array @infile1
foreach (0..$#ARGV-1) {
  push @infile1, $ARGV[$_];
}
# set the variable $infile2 to the last argument i.e. the "input2" file
$infile2 = $ARGV[$#ARGV];

# define hashes - %chartonum, %numtochar and %mainhash
my %chartonum = qw(A/A 1 A/G 2 G/G 3);
my %numtochar = qw(1 A/A 2 A/G 3 G/G);
my %mainhash;

# first process all "input1" files i.e. all elements of the array @infile1
foreach $file1 (@infile1) {
  open(INFILE, $file1) or die "Can't open $file1: $!";
  while (<INFILE>) {
    chomp;
    s/"//g;
    s/[ ]+/ /g;
    if ($. == 1) {
      @x = split/ /;
    } else {
      @y = split/ /;
      foreach $i (1..$#y) {
        $mainhash{$x[$i].",".$chartonum{$y[$i]}}++;
      }
    }
  }
  close(INFILE) or die "Can't close $file1: $!";
}

# print the header
printf("%-12s%-12s%-12s%-12s%-12s%-s\n","\"ID\"","\"Label\"","\"StYPE\"","\"Ntype\"","\"Stype_No\"","\"log\"");
# now start processing the "input2" file
open(INFILE, $infile2) or die "Can't open $infile2: $!";
while (<INFILE>) {
  if ($. > 1) {
    chomp;
    s/"//g;
    s/[ ]+/ /g;
    # print $_,"\n";
    @z = split/ /;
    if (!defined $prev or $z[0] ne $prev) {$num = 1} else {$num++};
    $prev = $z[0];
    printf("%-12s%-12s%-12s%-12s%-12s%-s\n",
           "\"$z[0]\"",
           "\"$z[1]\"",
           defined $numtochar{$num} ? "\"$numtochar{$num}\"" : "\"NULL\"",
           exists $numtochar{$num} ? $num : "\"null\"",
           defined $mainhash{$z[0].",".$num} ? $mainhash{$z[0].",".$num} : "\"null\"", 
           $z[2]
          );
  }
}
close(INFILE) or die "Can't close $infile2: $!";

$ 
$ # Error checking - incorrect number of arguments
$ 
$ perl combine.pl
Usage:   perl combine.pl <list of input files separated by space> input2
Example: perl combine.pl input1a input1b input1c input2
$ 
$ perl combine.pl input1a
Usage:   perl combine.pl <list of input files separated by space> input2
Example: perl combine.pl input1a input1b input1c input2
$ 
$ perl combine.pl input2
Usage:   perl combine.pl <list of input files separated by space> input2
Example: perl combine.pl input1a input1b input1c input2
$ 
$ echo $?
1
$ 
$ # Successful run
$ 
$ perl combine.pl input1a input1b input1c input2
"ID"        "Label"     "StYPE"     "Ntype"     "Stype_No"  "log"
"S1"        "xxx"       "A/A"       1           18          2.8
"S1"        "xxx"       "A/G"       2           6           3
"S1"        "xxx"       "G/G"       3           3           4
"S2"        "yyy"       "A/A"       1           3           6.8
"S2"        "yyy"       "A/G"       2           6           7
"S2"        "yyy"       "G/G"       3           18          7.4
"S2"        "yyy"       "NULL"      "null"      "null"      8
"S3"        "zzz"       "A/A"       1           9           12
"S3"        "zzz"       "A/G"       2           9           14
"S3"        "zzz"       "G/G"       3           9           16
"S3"        "zzz"       "NULL"      "null"      "null"      18
"S3"        "zzz"       "NULL"      "null"      "null"      20
$ 
$ echo $?
0
$ 
$

HTH,
tyler_durden

stateperl · March 7, 2010, 5:59am

Thanx Tyler it's working great

ruby_sgp · March 12, 2010, 9:55am

Nice ones

stateperl · March 12, 2010, 10:49am

hEY small alteration at defining hashes

# define hashes - %chartonum, %numtochar and %mainhash
my %chartonum = qw(A/A 1 A/G 2 G/G 3);
my %numtochar = qw(1 A/A 2 A/G 3 G/G);

Change like this
1st set = AA or CC or GG or TT
2nd set=AC, AG, AT, CG, CT, GT etc

2nd set MUST be 2
1stset MUST be 1 OR 3 - if it has T/T and G/G take one as 1 and another as 3
but output should have where (A/A or others ) 1 or 2 or 3 came from
For example

# define hashes - %chartonum, %numtochar and %mainhash
my %chartonum = qw(A/A 1 T/T 3 G/G 1 C/C 3 A/T 2 A/G 2 A/C 2 T/A 2 T/G 2 T/C 2 G/A 2 G/C 2 C/A 2 C/T 2 C/G 2);

tHIS ONE IS WROKING BUT NOT GIVING WHERE THE VALUES CAME [A/A OR G/G OR OTHERS FROM like this
the second bold is shouldn't be T/T
input1

"aphab"    "S1"    "S2"    "S3"
"a"    "T/T"    "A/A"    "A/A"
"b"    "A/G"    "A/G"    "A/A"
"c"    "T/T"    "G/G"    "A/A"
"d"    "G/G"    "A/G"    "A/G"
"e"    "A/G"    "G/G"    "A/G"
"f"     "T/T"    "G/G"    "A/G"
"g"    "T/T"    "G/G"    "G/G"
"h"    "T/T"    "G/G"    "G/G"
"I"     "T/T"    "G/G"    "G/G"

input2

"ID"    "Label"    "log"
"S1"    "xxx"    2.8
"S1"    "xxx"    3
"S1"    "xxx"    4
"S2"    "yyy"    6.8
"S2"    "yyy"    7
"S2"    "yyy"    7.4
"S2"    "yyy"    8
"S3"    "zzz"    12
"S3"    "zzz"    14
"S3"    "zzz"    16
"S3"    "zzz"    18
"S3"    "zzz"    20

$ perl combine.pl
"ID"        "Label"     "StYPE"     "Ntype"     "Stype_No"  "log"
"S1"        "xxx"       "T/T"       1           6           2.8
"S1"        "xxx"       "A/G"       2           2           3
"S1"        "xxx"       "G/G"       3           1           4
"S2"        "yyy"       "A/A"       1           1           6.8
"S2"        "yyy"       "A/G"       2           2           7
"S2"        "yyy"       "G/G"       3           6           7.4
"S2"        "yyy"       "NULL"      "null"      "null"      8
"S3"        "zzz"       "A/A"       1           3           12
"S3"        "zzz"       "A/G"       2           3           14
"S3"        "zzz"       "G/G"       3           3           16
"S3"        "zzz"       "NULL"      "null"      "null"      18
"S3"        "zzz"       "NULL"      "null"      "null"      20

durden_tyler · March 13, 2010, 8:35am

stateperl:

...
the second bold is shouldn't be T/T
input1

"aphab"    "S1"    "S2"    "S3"
"a"    "T/T"    "A/A"    "A/A"
"b"    "A/G"    "A/G"    "A/A"
"c"    "T/T"    "G/G"    "A/A"
"d"    "G/G"    "A/G"    "A/G"
"e"    "A/G"    "G/G"    "A/G"
"f"     "T/T"    "G/G"    "A/G"
"g"    "T/T"    "G/G"    "G/G"
"h"    "T/T"    "G/G"    "G/G"
"I"     "T/T"    "G/G"    "G/G"

input2

"ID"    "Label"    "log"
"S1"    "xxx"    2.8
"S1"    "xxx"    3
"S1"    "xxx"    4
"S2"    "yyy"    6.8
"S2"    "yyy"    7
"S2"    "yyy"    7.4
"S2"    "yyy"    8
"S3"    "zzz"    12
"S3"    "zzz"    14
"S3"    "zzz"    16
"S3"    "zzz"    18
"S3"    "zzz"    20

...

What should be the output for these files then ?

tyler_durden

stateperl · March 13, 2010, 11:22am

Condition::

1. if letters are same it has to be 1 or 3 [ A/A or T/T or G/G or C/C ]
2. if same ID has 2 same letters first one has to be 1 and other has to be 3 [ see the ID S1 has T/T(bold) as 1 and G/G as 3.:::Red bold ]
3. if letters are different it has to be 2 [ A/G or T/A or T/C or others ]

modified-input1

"aphab"    "S1"    "S2"    "S3"
"a"    "T/T"    "A/A"    "A/A"
"b"    "A/G"    "A/G"    "A/A"
"c"    "T/T"    "G/G"    "A/A"
"d"    "G/G"    "A/G"    "A/G"
"e"    "A/G"    "G/G"    "A/G"
"f"     "T/T"    "G/G"    "A/G"
"g"    "T/T"    "G/G"    "G/G"
"h"    "T/T"    "G/G"    "G/G"
"I"     "T/T"    "G/G"    "G/G"

same old-input2

"ID"    "Label"    "log"
"S1"    "xxx"    2.8
"S1"    "xxx"    3
"S1"    "xxx"    4
"S2"    "yyy"    6.8
"S2"    "yyy"    7
"S2"    "yyy"    7.4
"S2"    "yyy"    8
"S3"    "zzz"    12
"S3"    "zzz"    14
"S3"    "zzz"    16
"S3"    "zzz"    18
"S3"    "zzz"    20

newoutput

"ID"        "Label"     "StYPE"     "Ntype"     "Stype_No"  "log"
"S1"        "xxx"       "T/T"       1           6           2.8
"S1"        "xxx"       "A/G"       2           2           3
"S1"        "xxx"       "G/G"         3          1           4
"S2"        "yyy"       "A/A"       1           1           6.8
"S2"        "yyy"       "A/G"       2           2           7
"S2"        "yyy"       "G/G"       3           6           7.4
"S2"        "yyy"       "NULL"      "null"      "null"      8
"S3"        "zzz"       "A/A"       1           3           12
"S3"        "zzz"       "A/G"       2           3           14
"S3"        "zzz"       "G/G"       3           3           16
"S3"        "zzz"       "NULL"      "null"      "null"      18
"S3"        "zzz"       "NULL"      "null"      "null"      20

durden_tyler · March 15, 2010, 11:12pm

stateperl:

Condition::

1. if letters are same it has to be 1 or 3 [ A/A or T/T or G/G or C/C ]
2. if same ID has 2 same letters first one has to be 1 and other has to be 3 [ see the ID S1 has T/T(bold) as 1 and G/G as 3.:::Red bold ]
3. if letters are different it has to be 2 [ A/G or T/A or T/C or others ]

modified-input1

"aphab"    "S1"    "S2"    "S3"
"a"    "T/T"    "A/A"    "A/A"
"b"    "A/G"    "A/G"    "A/A"
"c"    "T/T"    "G/G"    "A/A"
"d"    "G/G"    "A/G"    "A/G"
"e"    "A/G"    "G/G"    "A/G"
"f"     "T/T"    "G/G"    "A/G"
"g"    "T/T"    "G/G"    "G/G"
"h"    "T/T"    "G/G"    "G/G"
"I"     "T/T"    "G/G"    "G/G"

same old-input2

"ID"    "Label"    "log"
"S1"    "xxx"    2.8
"S1"    "xxx"    3
"S1"    "xxx"    4
"S2"    "yyy"    6.8
"S2"    "yyy"    7
"S2"    "yyy"    7.4
"S2"    "yyy"    8
"S3"    "zzz"    12
"S3"    "zzz"    14
"S3"    "zzz"    16
"S3"    "zzz"    18
"S3"    "zzz"    20

newoutput

"ID"        "Label"     "StYPE"     "Ntype"     "Stype_No"  "log"
"S1"        "xxx"       "T/T"       1           6           2.8
"S1"        "xxx"       "A/G"       2           2           3
"S1"        "xxx"       "G/G"         3          1           4
"S2"        "yyy"       "A/A"       1           1           6.8
"S2"        "yyy"       "A/G"       2           2           7
"S2"        "yyy"       "G/G"       3           6           7.4
"S2"        "yyy"       "NULL"      "null"      "null"      8
"S3"        "zzz"       "A/A"       1           3           12
"S3"        "zzz"       "A/G"       2           3           14
"S3"        "zzz"       "G/G"       3           3           16
"S3"        "zzz"       "NULL"      "null"      "null"      18
"S3"        "zzz"       "NULL"      "null"      "null"      20

Well, in this case, you'll have to generate the key-value pairs in the three hashes - %chartonum, %numtochar and %mainhash as you iterate through "input1", based on the 3 conditions mentioned.

$ 
$ 
$ cat input1
"aphab"    "S1"    "S2"    "S3"
"a"    "T/T"    "A/A"    "A/A"
"b"    "A/G"    "A/G"    "A/A"
"c"    "T/T"    "G/G"    "A/A"
"d"    "G/G"    "A/G"    "A/G"
"e"    "A/G"    "G/G"    "A/G"
"f"     "T/T"    "G/G"    "A/G"
"g"    "T/T"    "G/G"    "G/G"
"h"    "T/T"    "G/G"    "G/G"
"I"     "T/T"    "G/G"    "G/G"
$ 
$ cat input2
"ID"    "Label"    "log"
"S1"    "xxx"    2.8
"S1"    "xxx"    3
"S1"    "xxx"    4
"S2"    "yyy"    6.8
"S2"    "yyy"    7
"S2"    "yyy"    7.4
"S2"    "yyy"    8
"S3"    "zzz"    12
"S3"    "zzz"    14
"S3"    "zzz"    16
"S3"    "zzz"    18
"S3"    "zzz"    20
$ 
$ cat -n combine_2.pl
     1  #!/usr/bin/perl -w
     2
     3  my %chartonum;
     4  my %numtochar;
     5  my %mainhash;
     6
     7  # first process all "input1" files i.e. all elements of the array @infile1
     8  $file1 = "input1";
     9
    10  open(INFILE, $file1) or die "Can't open $file1: $!";
    11  while (<INFILE>) {
    12    chomp;
    13    s/"//g;
    14    s/[ ]+/ /g;
    15    if ($. == 1) {
    16      @x = split/ /;
    17    } else {
    18      @y = split/ /;
    19      foreach $i (1..$#y) {
    20        @t = split (/\//, $y[$i]);
    21        if ($t[0] eq $t[1]) {
    22          if (not defined $chartonum{$x[$i].",".$y[$i]} and
    23              not defined $numtochar{$x[$i].",1"}       and
    24              not defined $numtochar{$x[$i].",3"}) {
    25            $numtochar{$x[$i].",1"} = $y[$i];
    26            $chartonum{$x[$i].",".$y[$i]} = 1;
    27          }
    28          elsif (not defined $chartonum{$x[$i].",".$y[$i]} and
    29                 not defined $numtochar{$x[$i].",3"}) {
    30            $numtochar{$x[$i].",3"} = $y[$i];
    31            $chartonum{$x[$i].",".$y[$i]} = 3;
    32          }
    33        } else {
    34          if (not defined $chartonum{$x[$i].",".$y[$i]} and
    35              not defined $numtochar{$x[$i].",2"}) {
    36            $numtochar{$x[$i].",2"} = $y[$i];
    37            $chartonum{$x[$i].",".$y[$i]} = 2;
    38          } # end of if not defined
    39        } # end of else i.e. t[0] ne t[1]
    40        $mainhash{$x[$i].",".$chartonum{$x[$i].",".$y[$i]}}++;
    41      } # end of foreach
    42    } # end of $. > 1
    43  }
    44  close(INFILE) or die "Can't close $file1: $!";
    45
    46  # print the header
    47  printf("%-12s%-12s%-12s%-12s%-12s%-s\n","\"ID\"","\"Label\"","\"StYPE\"","\"Ntype\"","\"Stype_No\"","\"log\"");
    48  # now start processing the "input2" file
    49  $infile2 = "input2";
    50  open(INFILE, $infile2) or die "Can't open $infile2: $!";
    51  while (<INFILE>) {
    52    if ($. > 1) {
    53      chomp;
    54      s/"//g;
    55      s/[ ]+/ /g;
    56      # print $_,"\n";
    57      @z = split/ /;
    58      if (!defined $prev or $z[0] ne $prev) {$num = 1} else {$num++};
    59      $prev = $z[0];
    60      printf("%-12s%-12s%-12s%-12s%-12s%-s\n",
    61             "\"$z[0]\"",
    62             "\"$z[1]\"",
    63             defined $numtochar{$z[0].",".$num} ? "\"".$numtochar{$z[0].",".$num}."\"" : "\"NULL\"",
    64             exists $numtochar{$z[0].",".$num} ? $num : "\"null\"",
    65             defined $mainhash{$z[0].",".$num} ? $mainhash{$z[0].",".$num} : "\"null\"", 
    66             $z[2]
    67            );
    68    }
    69  }
    70  close(INFILE) or die "Can't close $infile2: $!";
    71
$ 
$ perl combine_2.pl
"ID"        "Label"     "StYPE"     "Ntype"     "Stype_No"  "log"
"S1"        "xxx"       "T/T"       1           6           2.8
"S1"        "xxx"       "A/G"       2           2           3
"S1"        "xxx"       "G/G"       3           1           4
"S2"        "yyy"       "A/A"       1           1           6.8
"S2"        "yyy"       "A/G"       2           2           7
"S2"        "yyy"       "G/G"       3           6           7.4
"S2"        "yyy"       "NULL"      "null"      "null"      8
"S3"        "zzz"       "A/A"       1           3           12
"S3"        "zzz"       "A/G"       2           3           14
"S3"        "zzz"       "G/G"       3           3           16
"S3"        "zzz"       "NULL"      "null"      "null"      18
"S3"        "zzz"       "NULL"      "null"      "null"      20
$ 
$

Use the Data :: Dumper module to check the values of the 3 hashes right after "input1" is done processing at line 45.

But frankly, given the complexity of calculations involved, I'd rather look for some Perl Bioinformatics modules that have subroutines to do this.
Or check BioPerl, or books like "Beginning/Mastering Perl for Bioinformatics" at amazon.com.

HTH,
tyler_durden

PS - I'm assuming those A, C, G, T are the nucleotide bases of a DNA strand, and these files are related to Bioinformatics.

stateperl · March 16, 2010, 2:41am

Thank you very much for follow up and suggestions.
I think I should do more perl example practice. But thank you for your valuable time.

durden_tyler · March 16, 2010, 7:41am

You don't have the 5th string in your printf statement near the end of the program.
The line # 65 in the Perl program of my earlier post is the one missing.
The printf function at line 60 has the format masks for 6 strings, and those strings are typed, one per line, from line 61 through 66.

tyler_durden

stateperl · March 17, 2010, 12:24pm

yep that worked well. Thank u very much.

I think there is a small bug. when ever the script finds same IDs in input 1 (ex:S1) and input2 (ex:S1) , the output is fine. But if the input1 (ex: S1) doesn't have the same ID in input 2 (ex:S100) the output shouldn't include any thing from input1 or 2 [So the bug is that output is printing all unmatched IDs ]

Hope you don't feel bad about dragging this post a bit.

input1

"aphab"    "S1"  "S2"
"a"    "T/T"    "A/A"
"b"    "A/G"   "A/B"
"c"    "T/T"    "A/F"
"d"    "G/G"   "D/D"
"e"    "A/G"   "W/W"

input2

"ID"    "Label"    "log"
"S1"    "xxx"    2.8
"S1"    "xxx"    3
"S1"    "xxx"    4
"S100"    "yyy"    6.8
"S100"    "yyy"    7
"S100"    "yyy"    7.4
"S100"    "yyy"    8

output should be with only S1 [ no S2 or S100 values at all.]

"ID"        "Label"     "StYPE"     "Ntype"     "Stype_No"  "log"
"S1"        "xxx"       "T/T"       1           2           2.8
"S1"        "xxx"       "A/G"       2           2           3
"S1"        "xxx"       "G/G"         3          1           4

thanx

durden_tyler · March 17, 2010, 3:08pm

This can be accomplished by a minor change in the 2nd part of the script. If you go through the script carefully and understand it thoroughly, then you should be able to fix it.
I'll leave it for you as an exercise.

And I think this is a new requirement.
The course of action in case of mismatched IDs was never mentioned in any of your previous posts.
If the requirements are clear, and despite that, the program doesn't print correct output, then it may be classified as a bug.
But if the requirements are not clear, then the program is bound to take unpredictable/default courses of action that you may not agree with.
This can be easily remedied by being very clear as to what you want, from square one.

tyler_durden

stateperl · March 18, 2010, 3:20am

------------------------------

stateperl · March 19, 2010, 8:29pm

------------------------------------

stateperl · March 23, 2010, 11:34pm

Hey Tyler could you please explain the last code you have written. So that I could do my own modifications instead of asking you. Please. Sorry for bothering you.