How to remove matched rows from my file?

nengcheng · April 3, 2017, 9:05pm

Hello,

I am new beginner, and just got help from this forum. The command line is :

awk  '($1, $2) in x {
    print x[$1, $2]
    print
    delete x[$1, $2]
    next
}
{    x[$2, $1] = $0
}' results>myfile

I got a output "myfile" from the orginal file 'results'. The quesion is I don't know how to get all rows that are not shown in output file, or i just want to do negative selection. I think there is a way to do that ,but I spent hours and still have no idea. My previous question was here:

Glyma.10G051100 Glyma.02G036000 89.91 228 23 0 1 228 1 228 1e-78 294
Glyma.10G051100 Glyma.09G023700 87.28 228 29 0 1 228 1 228 1e-68 261
Glyma.10G285200 Glyma.20G103800 96.33 1663 55 4 1 1657 1 1663 0.0 2728
Glyma.10G285200 Glyma.05G093700 95.02 321 16 0 406 726 1 321 8e-142 505
Glyma.10G212900 Glyma.17G186600 90.36 1338 129 0 1 1338 1 1338 0.0 1757
Glyma.10G212900 Glyma.05G089000 90.21 1338 131 0 1 1338 1 1338 0.0 1746
Glyma.10G212900 Glyma.16G068000 88.67 1341 146 5 1 1338 1 1338 0.0 1629
Glyma.10G212900 Glyma.19G052400 88.83 1325 148 0 1 1325 1 1325 0.0 1628
Glyma.10G212900 Glyma.05G114900 88.25 1328 156 0 1 1328 1 1328 0.0 1589
Glyma.10G212900 Glyma.19G078900 89.31 262 27 1 1074 1335 202 462 2e-88 327
Glyma.10G212900 Glyma.19G078900 89.71 204 21 0 790 993 1 204 2e-68 261
Glyma.10G296300 Glyma.20G246900 95.11 470 23 0 1 470 1 470 0.0 741
Glyma.10G296300 Glyma.20G246900 92.26 168 7 2 744 911 834 995 2e-60 233
Glyma.10G001700 Glyma.10G179600 83.45 701 113 1 44 741 50 750 0.0 649
Glyma.10G179600 Glyma.10G001700 83.45 701 113 1 50 750 44 741 0.0 649
Glyma.10G056500 Glyma.10G056300 89.27 261 24 2 41 300 61 318 4e-88 324
Glyma.10G056300 Glyma.10G056500 89.27 261 24 2 61 318 41 300 5e-88 324
Glyma.10G088600 Glyma.10G085100 97.13 522 15 0 1 522 1 522 0.0 881
Glyma.10G085100 Glyma.10G088600 97.13 522 15 0 1 522 1 522 0.0 881

Don_Cragun · April 3, 2017, 10:33pm

Try this slight modification to your previous script. It produces two output files. The matched pairs of input lines will be written to the file named matched and the remaining input lines will be written to the file named unmatched :

awk '
($1, $2) in x {
	print x[$1, $2] > "matched"
	print > "matched"
	delete x[$1, $2]
	next
}
{	x[$2, $1] = $0
}
END {	for(key in x)
		print x[key] > "unmatched"
}' results

As with the code before, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

nengcheng · April 4, 2017, 9:58am

don cragun:

Try this slight modification to your previous script. It produces two output files. The matched pairs of input lines will be written to the file named matched and the remaining input lines will be written to the file named unmatched :
awk '
($1, $2) in x {
   print x[$1, $2] > "matched"
   print > "matched"
   delete x[$1, $2]
   next
}
{    x[$2, $1] = $0
}
END {    for(key in x)
   print x[key] > "unmatched"
}' results
As with the code before, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

Thank you for the information. The question is that the number of matched and unmatched lines are not equal to the total lines. I don't know where is wrong. I think the problem is that many A-B, B-A, pattern recurring many times but other values is different (in 3rd, 4th column etc). so the unmatched lines is significant lower than the rest of matched.

RudiC · April 4, 2017, 10:37am

We don't either, unless you post sample input and output data and the code used (unless it's the same as posted above).

nengcheng · April 4, 2017, 10:47am

How could I upload my sample? It's a large dateset, more than 10 Mb.

RudiC · April 4, 2017, 10:52am

So, that's a catch22, isn't it? How about posting the smallest possible set of test data that shows the problem?

nengcheng · April 4, 2017, 11:41am

I upload a very small sample size. 19 in total. The command gives the 11, 6 , respectively. I don't know what's wrong. Maybe something wrong with my format?

oh, I realized that I didn't remove A-B, A-B pattern, the lines has the same value for first two columns.

rbatte1 · April 5, 2017, 7:17am

Please paste the 19 lines in CODE tags into a post so we can see what you are starting with. If you can craft the output you would expect in a separate part (also wrapped in CODE tags) then that will help us to help you.

Kind regards,
Robin