How to compare two columns and retrieve data

I am a newbie to Unix and slowly learning it. I have a large data set with 8 different columns. I want to compare two columns and retrieve data if the two columns have similar number.

I have attached the example. There are two columns (S-Contig and N-Contig). I want to retrieve the data from rows where these two columns have a same number. For example, both these columns have 3205 and I want to write the data from all the columns for 3205 into a new file.

S-Contig    S_D    S_P    S_PD        N-Contig    N_D    N_P    N_PD
3205          1        1      1.00000        3196         2        1            0.50000
3254          51      42     0.82353        3205        1        2           2.00000
3259          28      12     0.42857        3216        4        5           1.25000

I have been trying to figure out this question for couple of hours now using awk and I am confused. I really appreciate if some one can help me with this.

Thank you

Please post, wrapped by CodeTags, a sample of the text file.
Many people do not want to download external files here.

1 Like

Your file doesn't have unix newlines. Change them and try this:

awk '{ a[$1] = $1 } $5 == a[$5]' problem.txt 

Thank you for your reply. I converted the file to unix newlines and tried awk command. I am having trouble in getting the output.
$ file problem.txt
problem.txt: ASCII text, with CR line terminators
$ sed -e 's/\r$//' problem.txt > problem2.txt
$ file problem2.txt
problem2.txt: ASCII text, with CR, LF line terminators
$awk '{ a[$1] = $1 } $5 == a[$5]' problem2.txt > ans.txt

# This gives me an empty ans.txt file. I would like to write all the data for these matching columns as an output.

Thank you

% cat >problem.txt
S-Contig    S_D    S_P    S_PD        N-Contig    N_D    N_P    N_PD
3205          1        1      1.00000        3196         2        1            0.50000
3254          51      42     0.82353        3205        1        2           2.00000
3259          28      12     0.42857        3216        4        5           1.25000
jazu@uf3 ~/tmp 
% awk '{ a[$1] = $1 } $5 == a[$5]' problem.txt
3254          51      42     0.82353        3205        1        2           2.00000

You need the file with only LF terminators. Try (GNU sed):

sed 's/\r/\n/g' problem.txt 

But I'm afraid my solution was wrong. Try this one:

awk '{a[$1]=$1; b[$1]=$0} $5==a[$5] {print b[$5]; print}' 

Thank you, Yazu for your help. I tried your code and it still doesn't give me a the correct output.
Here is the code I tried.

$ file problem.txt
problem.txt: ASCII text, with CR, LF line terminators
$ sed 's/\r/\n/g' problem.txt
3401ntig54      46_D    0.8000003368D    1345ntig170     1.00000 0.64286 NBH_PD
$ sed -e 's/\r$//' problem.txt > problem2.txt
$ file problem2.txt
problem2.txt: ASCII text, with CR, LF line terminators
$ perl -pe 's/\r\n|\n|\r/\n/g' problem.txt > problem2.txt
$ file problem2.txtproblem2.txt: ASCII text
$ awk '{a[$1]=$1; b[$1]=$0} $5==a[$5] {print b[$5]; print}' problem2.txt

3205    1    1    1.00000    3196    2    1    0.50000
3254    51    42    0.82353    3205    1    2    2.00000
3254    51    42    0.82353    3205    1    2    2.00000
3301    7    7    1.00000    3254    82    130    1.58537
3259    28    12    0.42857    3216    4    5    1.25000
3305    28    47    1.67857    3259    14    33    2.35714
3277    2    8    4.00000    3225    1    1    1.00000
3311    44    46    1.04545    3277    5    3    0.60000
3301    7    7    1.00000    3254    82    130    1.58537
3323    29    31    1.06897    3301    3    14    4.66667
3305    28    47    1.67857    3259    14    33    2.35714
3335    5    6    1.20000    3305    19    34    1.78947
3311    44    46    1.04545    3277    5    3    0.60000
3340    6    5    0.83333    3311    56    50    0.89286
3322    32    36    1.12500    3300    3    4    1.33333
3342    7    24    3.42857    3322    18    29    1.61111
3323    29    31    1.06897    3301    3    14    4.66667
3345    17    18    1.05882    3323    57    58    1.01754
3330    10    17    1.70000    3303    2    1    0.50000
3346    23    17    0.73913    3330    4    7    1.75000
3335    5    6    1.20000    3305    19    34    1.78947
3363    33    43    1.30303    3335    5    4    0.80000
3338    10    10    1.00000    3309    10    19    1.90000
3378    1    4    4.00000    3338    8    5    0.62500
3340    6    5    0.83333    3311    56    50    0.89286
3379    3    12    4.00000    3340    2    4    2.00000
3342    7    24    3.42857    3322    18    29    1.61111
3384    3    1    0.33333    3342    9    33    3.66667
3345    17    18    1.05882    3323    57    58    1.01754
3390    2    26    13.00000    3345    14    9    0.64286
3346    23    17    0.73913    3330    4    7    1.75000
3391    34    26    0.76471    3346    4    6    1.50000
3363    33    43    1.30303    3335    5    4    0.80000
3399    5    8    1.60000    3363    23    67    2.91304

I am not sure why some of rows are repeating. I am looking for an output like this:

3205    1    1    1.00000     3205    1    2    2.00000
3254    51    42    0.82353 3254    82    130    1.58537

Thank you for your help! I really appreciate it.

???

awk '$1 == $5' problem.txt

But you haven't such strings in your input file. What exactly kind of processing do you need?

Hi Yazu,

Thank you for being so helpful. I tried this code and it didn't work.
awk '$1 == $5' problem.txt
I am not sure if I understand your question. All I am trying to do is to write a new file with data from matching columns 1 and 5.
My input file has two sets of data. One set is from column 1 to 4 and another set is from column 5 to 8. Column 1 and 5 have similar numbers. I want to match those similarities and get all the other data from these files.

Hope my explanation is clear. Please let me know if you understand my question.

Thank you again.