How to sort common arrays from two different data files

pkdas · June 29, 2021, 4:03pm

I have two files with 2 columns each , together these two column make up a pair that I need. There are certain pairs that are common in both files. I need to create 3 files, one with the common pairs, and two files for exclusive pairs of each mother file.
It should be noted that common pairs may appear in any row of the other file.
I am using bash on ubuntu 18.04. I have tried with fortran, but for large files, it is crashing.
Please forgive me if I am unable to express my problem properly. Thank you in advance

File 1

File 2

bendingrodriguez · June 29, 2021, 5:44pm

Hi @pkdas,

using Fortran nowadays is quite unusual Here is an approach with python3:

#!/usr/bin/python3

# convert to int in order to make tuples sortable by number, not by string
f1 = set(tuple(map(int, ln.split())) for ln in open('f1'))
f2 = set(tuple(map(int, ln.split())) for ln in open('f2'))
# sets by definition haven't any order, so sort them
print('\n'.join('{} {}'.format(*t) for t in sorted(f1 & f2)), file=open('f.both', 'w'))
print('\n'.join('{} {}'.format(*t) for t in sorted(f1 - f2)), file=open('f.only-1', 'w'))
print('\n'.join('{} {}'.format(*t) for t in sorted(f2 - f1)), file=open('f.only-2', 'w'))

And for verification in bash:

#!/bin/bash

for f in f.both f.only*; do
    echo "${f#*.}:"
    # read pairs from outfiles and search for them in both infiles
    while read x y; do
        # \b matches at the edge of a word/number
        # that's only because of the different spaces in the infiles,
        #   otherwise "^$x $y$" would be enough, and the sed wouln't be needed
        # then sort by 1st number & strip multiple spaces
        grep -E "\b$x\b.*\b$y\b" f1 f2 | sort -g -k2 | sed -r 's,[[:space:]]+, ,g'
    done < $f
done

There are of course other and faster methods, especially with awk, or (maybe) with tools like cmp, join, paste etc. I only suggested this because it's a typical example for sets in python.

sbuckman1 · June 29, 2021, 5:49pm

pkdas:

# compare the files - output in three tab'd columns
# use awk to build your three output files. 
comm --output-delimiter=$'\t' <(sort file1) <(sort file2) | 
 awk -F"\t" '{if($1 != ""){print $0 >"same.txt"}};
                    {if($2 != ""){print $0 >"left.txt"}};
                    {if($3 != ""){print $0 >"right.txt"}}'

MadeInGermany · June 29, 2021, 6:37pm

The same comm solution step by step (using less memory)
Are the input files sorted?
If not then sort them

sort -o file1 file1
sort -o file2 file2

Is the spacing consistent?
Assuming yes.
Then create the output files

comm -23 file1 file2 >only_file1
comm -13 file1 file2 >only_file2
comm -12 file1 file2 >common_file1_2

system · August 28, 2021, 6:37pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.