I don't quite understand your implementation in Python, but I do understand your problem. A suggested algorithm is as follows:
(1) Read a line, split it by Tabs and then determine if the first token exists as a hash key.
(2) If it does not, then assign a "template" array that has "N,N"s at all the right places.
The template array is this: ("data1", "N,N", "data2", "N,N", "data3", "N,N").
(3) From the 2nd token, determine the index of the template array that you want to update.
So, for example, if the 2nd token is "data1", you extract 1 from it and you know that the "N,N" after "data1" is to be updated.
(4) Update the template array at the relevant index.
I've implemented this algorithm in Perl, and it is posted below.
A few notes:
(1) A hash value is expected to be a scalar in Perl, so you set the "reference" to an array as the hash value. A "reference" in Perl is similar to a "pointer" in C. No clue what it is called in Python, or if you are using that in your code.
(2) Perl has 0-based arrays i.e. the first index of an array is 0. (It appears that Python array indexes start with 1, by looking at your code.) So, once you read the second token, say, "data3", and extract 3 from it, then you'll have to update index 5 (=2*3 - 1) of the template array.
(3) I've adopted this "hard-coded" template array approach because I see this in your code:
which makes me believe that a particular token "SN1", or "AC2" or "TP3" can have at the most three records. If that's not the case, then the problem becomes more interesting!
By this approach, once you are done reading the file, your data structure is ready and you can simply print off the results. I hope the script comments are sufficient.
$
$ # check the data file
$ cat -n test.txt
1 SN1 data1 A,A
2 SN1 data2 A,B
3 SN1 data3 A,C
4 AC2 data1 A,B
5 AC2 data2 A,C
6 TP3 data3 C,C
7 TP3 data1 C,A
$
$ # check the program file
$ cat -n process_test.pl
1 #!/usr/bin/perl
2 use strict;
3 use warnings;
4
5 die "Specify filename\n" if not defined $ARGV[0]; # Ask for filename
6 my %compare; # Declare the hash to store all information
7 my $file = $ARGV[0]; # Assign the filename to a variable
8 open (FH, "<", $file) or die "Can't open $file: $!"; # Open the file handle; balk on error
9 while (<FH>) { # Loop through the file, line by line
10 chomp; # Remove the End-of-Line character
11 my @tokens = split/\t+/; # Split line on Tab and assign to array "tokens"
12 if (not defined $compare{$tokens[0]}) { # If 1st element of "tokens" is not a key, then
13 $compare{$tokens[0]} = [ "data1", "N,N", # Create the key in the "compare" hash and
14 "data2", "N,N", # assign a template value with default "N,N"s
15 "data3", "N,N" # The [] returns a reference to the array, since
16 ]; # the hash value must be a scalar in Perl.
17 }
18 (my $index = $tokens[1]) =~ s/\D+//; # Determine the array index to be updated
19 $compare{$tokens[0]}->[2*$index-1] = $tokens[2]; # And then update the array
20 } # Done reading the file
21 close (FH) or die "Can't close $file: $!"; # So close it; balk on error
22 while (my ($k, $v) = each %compare) { # Loop through the hash
23 printf ("%s %s\n", $k, join (" ", @{$compare{$k}})); # and print out the keys and values
24 }
$
$ # A dry run
$ perl process_test.pl
Specify filename
$
$ # A successful run
$ perl process_test.pl test.txt
AC2 data1 A,B data2 A,C data3 N,N
TP3 data1 C,A data2 N,N data3 C,C
SN1 data1 A,A data2 A,B data3 A,C
$
$