How to create a matrix from coordinates

AnaGuerrero · October 20, 2022, 12:34am

Hi all,

I am just learning bash and python, and came with a basic question that you may know how to get. I have a file with coordinates like:

1,1
1,3
2,2
3,3
4,4
4,5
5,5

The challenge is to create a matrix from these coordinates, so I will have an output like:

   1 2 3 4 5
1  1 0 1 0 0
2  0 1 0 0 0
3  1 0 1 0 0
4  0 0 0 1 1
5  0 0 0 1 1

The way I thought about it, was just to print 1 if the coordinates are present (assuming in this example 1,1;1,3; 2,2; 3;3;4,4;4,5;5,5) and then add 0's for the rest...This is a toy set but in reality I have bigger 2-column files that goes until 100,000, so I will have 100,000 rows vs 100,000 columns.
Is there anything you could think of, in order to create the matrix?
Like I said, I pretty new at this but I found something in matlab that could help 2-D and 3-D grids - MATLAB meshgrid. However, since I am learning bash/python, I am looking something in those languages.
Any advice?
Thanks in advance.
Ana

EyeOfSauron · October 20, 2022, 1:53am

Welcome @AnaGuerrero

Are the results to be a "truth table" with only ones and zeros; or are you going to count the pair occurances?

munkeHoller · October 20, 2022, 6:01am

hi,
there are a number of existing python packages that may have what you are looking for - the link below gives a synopsis of some with links to the specific packages.

the numpy package has a meshgrid method numpy.meshgrid — NumPy v1.24 Manual

bendingrodriguez · October 20, 2022, 2:15pm

Hi @AnaGuerrero,

well, printing a 100kx100k matrix on the screen doesn't make much sense. And if you want it to save to a file, you shouldn't use plain text, but binary, i.e. 'packed' format, especially for a (large) boolean matrix.

Below is a very simple approach in py3, that uses a tuple (row, col) as dict key and that assumes that row and col are integers. Since the matrix is symmetric (as given by your in- and output), the transposed element (col, row) does not have to be stored separately. (row, col) elements which are not seen in the input data won't be created resp. stored.
The dimension d is determined dynamically, but could also simply be passed as an argument to the script or be statically set.

You could also use a 2D array resp. list, but then the size should be known beforehand in order not to have to allocate memory dynamically while reading in the data, which in general costs some more time (for large data). It also has the disadvantage that it always has to contain d*d elements - unless you only save the part above the diagonal incl. the diagonal itself, i.e. d+(d-1)+..+1 = d*(d+1)/2 elements, a bit more than the half of the former. As far as access speed is concerned, list access m[r][c] normally is a bit faster than dict access m[(r, c)]. But, as already mentioned by @munkeHoller, for working with arrays/matrices, numpy is a good choice, especially when processing large amounts of data.

There are also optimized storage methods for sparse and/or boolean matrices, see e.g. Sparse matrix - Wikipedia or https://stackoverflow.com/questions/9243004/storing-large-amount-of-boolean-data-in-python.

d = 0 # dimension of matrix
m = {} # the matrix
with open("infile") as fin:
    for (a, b) in [ln.strip().split(",") for ln in fin]:
        r, c = int(a), int(b)
        m[(r, c)] = 1
        d = max(r, c, d)
# printing aligned row & col no. omitted
for r in range(1, d+1):
    for c in range(1, d+1):
        # dict.get(unknown_key) returns None
        print(m.get((r, c)) or m.get((c, r)) or 0, end=" ")
        # or, by converting bool to int
        #print(int((r, c) in m or (c, r) in m), end=" ")
    print()

vgersh99 · October 20, 2022, 4:51pm

@AnaGuerrero,
the same could be done with gawk where you save ONLY the cells containing 1 - the "missing" cells could be treated as containing 0.

If you want to "save/dump" the populated matrix and "read" it back in later on, you could take advantage of the writea and reada from the rwarray extension of gawk: Dumping and Restoring an Array.

AnaGuerrero · October 20, 2022, 5:19pm

Hi @EyeOfSauron ,
Yeah, Im only looking for a table for 1 and 0's.

AnaGuerrero · October 20, 2022, 5:51pm

bendingrodriguez:

d = 0 # dimension of matrix
m = {} # the matrix
with open("infile") as fin:
    for (a, b) in [ln.strip().split(",") for ln in fin]:
        r, c = int(a), int(b)
        m[(r, c)] = 1
        d = max(r, c, d)
# printing aligned row & col no. omitted
for r in range(1, d+1):
    for c in range(1, d+1):
        # dict.get(unknown_key) returns None
        print(m.get((r, c)) or m.get((c, r)) or 0, end=" ")
        # or, by converting bool to int
        #print(int((r, c) in m or (c, r) in m), end=" ")
    print()

Thank you! this is working perfectly!!

AnaGuerrero · October 20, 2022, 5:51pm

Thank you for the help

AnaGuerrero · October 20, 2022, 5:52pm

Thank you for the links!