Find all lines in file such that each word on that line appears in at least n lines of the file

uncleMonty · June 14, 2017, 9:49pm

I have a file where every line includes four expressions with a caret in the middle (plus some other "words" or fields, always separated by spaces). I would like to extract from this file, all those lines such that each of the four expressions containing a caret appears in at least four different lines of the whole file. Could anyone help me?

Here is a section of my file:

5^4 + 32^1 = 6^3 + 21^2    (625, 32, 216, 441)
5^4 + 34^2 = 12^3 + 53^1    (625, 1156, 1728, 53)
5^4 + 40^2 = 13^3 + 28^1    (625, 1600, 2197, 28)
5^4 + 42^1 = 7^3 + 18^2    (625, 42, 343, 324)
5^4 + 53^2 = 15^3 + 59^1    (625, 2809, 3375, 59)
5^4 + 56^1 = 8^3 + 13^2    (625, 56, 512, 169)
5^4 + 66^2 = 17^3 + 68^1    (625, 4356, 4913, 68)
5^4 + 75^1 = 6^3 + 22^2    (625, 75, 216, 484)
5^5 + 6^4 = 65^1 + 66^2    (3125, 1296, 65, 4356)
5^5 + 7^1 = 6^3 + 54^2    (3125, 7, 216, 2916)
5^5 + 7^4 = 50^1 + 74^2    (3125, 2401, 50, 5476)
5^5 + 8^3 = 37^1 + 60^2    (3125, 512, 37, 3600)
5^5 + 9^3 = 10^1 + 62^2    (3125, 729, 10, 3844)
5^5 + 10^3 = 8^4 + 29^1    (3125, 1000, 4096, 29)
5^5 + 16^2 = 6^1 + 15^3    (3125, 256, 6, 3375)
5^5 + 17^2 = 15^3 + 39^1    (3125, 289, 3375, 39)
5^5 + 18^2 = 15^3 + 74^1    (3125, 324, 3375, 74)
5^5 + 19^1 = 14^3 + 20^2    (3125, 19, 2744, 400)
5^5 + 20^1 = 6^4 + 43^2    (3125, 20, 1296, 1849)
5^5 + 27^1 = 7^3 + 53^2    (3125, 27, 343, 2809)
5^5 + 32^2 = 8^4 + 53^1    (3125, 1024, 4096, 53)
5^5 + 32^2 = 16^3 + 53^1    (3125, 1024, 4096, 53)
5^5 + 33^1 = 13^3 + 31^2    (3125, 33, 2197, 961)
5^5 + 43^2 = 17^3 + 61^1    (3125, 1849, 4913, 61)
5^5 + 47^1 = 12^3 + 38^2    (3125, 47, 1728, 1444)
5^5 + 55^1 = 11^3 + 43^2    (3125, 55, 1331, 1849)
5^5 + 59^2 = 9^4 + 45^1    (3125, 3481, 6561, 45)
5^5 + 60^1 = 7^4 + 28^2    (3125, 60, 2401, 784)
5^5 + 60^1 = 14^3 + 21^2    (3125, 60, 2744, 441)
5^6 + 8^4 = 27^3 + 38^1    (15625, 4096, 19683, 38)
5^6 + 16^1 = 10^3 + 11^4    (15625, 16, 1000, 14641)
5^6 + 20^4 = 9^1 + 56^3    (15625, 160000, 9, 175616)
5^6 + 35^2 = 7^5 + 43^1    (15625, 1225, 16807, 43)
5^6 + 45^2 = 26^3 + 74^1    (15625, 2025, 17576, 74)

So in what I would like to extract from the file, the last line would only be included if each of "5^6", "45^2", "26^3" and "74^1" appears on at least four different lines of the entire file. Thanks for any help!

Don_Cragun · June 14, 2017, 10:15pm

Is this a homework assignment? Homework and coursework questions can only be posted in the Homework & Coursework Questions forum under special homework rules.

Please review the rules, which you agreed to when you registered, if you have not already done so.

If this post is not homework, please explain the company you work for and the nature of the problem you are working on. And, tell us what operating system and shell you're using, and show us what you have tried to do to solve this problem on your own.

If you did post homework in the main forums, please review the guidelines for posting homework and repost.

uncleMonty · June 15, 2017, 8:15am

Thanks for the friendly welcome Don. I haven't had any homework assignments for over 25 years. I'm a hobbyist working on a maths problem. I wrote a little C program to generate this data, and want to sort through it with shell tools as an intermediate step to solving the problem empirically (as a hint to myself, before I try to solve it mathematically). I am using Bash by default, since it is the default shell on my laptop running OS 10.6, but other shells are available. What I have done so far: stared at it and realised I don't know how to do this kind of multi-line search with the handful of shell commands I have taught myself over the last 30 years (and only used very infrequently, when such problems come up). I suppose I could also have tried to do this weeding out within my C program, but I can't see how to do it without having to hold everything in memory all at once (again, I write such programs very infrequently). So, it seems better to write it to a file then use some other tool in the shell to search that file. Hence my posting here. I'm sure there is a better way, but I break out my C and shell scripts about once every 6 months and at my age it's often easier to ask.

Is there anyone less suspicious who might be able to point me in a useful direction?

RudiC · June 15, 2017, 8:26am

No reason to become ironic. This forum has a high reputation of NOT helping students and / or candidates cheat their way through classwork or exams, so questions of that kind are adequate and accepted.

Still: welcome to the forum.

For your problem, try

awk '{CNT[$1]++; CNT[$3]++;CNT[$5]++; CNT[$7]++} END {for (c in CNT) if (CNT[c] > 3) print c, "occurs", CNT[c], "times."}' file
15^3 occurs 4 times.
5^4 occurs 8 times.
5^5 occurs 21 times.
5^6 occurs 5 times.

It doesn't check if terms occur twice in one line, but the chances of that happening are quite low, I believe.

uncleMonty · June 15, 2017, 9:43am

Thank you Rudi. I should learn awk, shouldn't I. That is a good way to count the occurrences. Is there a way, having counted the occurrences, to echo an entire line, if and only if the 1st 3rd 5th and 7th field of that line all appear at least 4 times in the file? (For the smaller sample data I posted, it would find an answer if we searched for lines whose entries all appear at least twice, instead of four times.)

You are correct not to worry about repeats within a single line, this is ruled out by construction of the data.

P.s. apologies if I overreacted--I think what was irritating was not that someone would want to make sure my question wasn't homework (I agree that a forum can quickly become useless to experts if it is overrun by homework questions), but instead the order to "please explain the company you work for and the nature of the problem you are working on", not only because it is intrusive, but because it suggests that only people who work for a company with a work-related problem can legitimately ask for scripting assistance here. But: your forum, your rules, ok.

Don_Cragun · June 15, 2017, 10:52am

If you don't mind reading the file twice, it is pretty simple with awk :

awk -v cnt=2 '
FNR == NR {
	c[$1]++
	c[$3]++
	c[$5]++
	c[$7]++
	next
}
c[$1] >= cnt && c[$3] >= cnt && c[$5] >= cnt && c[$7] >= cnt' file file

With cnt set to 4, you don't get any output with your posted sample data. With cnt set to 2, this produces the output:

5^5 + 18^2 = 15^3 + 74^1    (3125, 324, 3375, 74)
5^5 + 32^2 = 8^4 + 53^1    (3125, 1024, 4096, 53)
5^5 + 60^1 = 14^3 + 21^2    (3125, 60, 2744, 441)

You haven't told us what operating system you're using... If you're using a Solaris/SunOS system, you'll need to change awk in the above to /usr/xpg4/bin/awk or nawk .

RudiC · June 15, 2017, 10:53am

I'm certain Don Cragun will accept the apologies. The forum maintainers' attitude is less to not to become useless - people in here REALLY like to help with also minor problems - but to keep up the quality of IT education. If a student fills in the homework form including institution, course and professor, s/he will be helped to develop in the right direction and find a solution of his/her own; c.f. http://www.unix.com/homework-and-coursework-questions/. By the way, vague comments on a person's company like "chemical" or "administration" would have sufficed, or even you telling us you're a hobbyist.

Back to your problem. Outputting the entire line that satisfies a condition means either keep ALL lines in memory (demanding for BIG files) or run through the input file twice - once for counting, once for printing. This is the approach in here:

awk 'NR == FNR {CNT[$1]++; CNT[$3]++;CNT[$5]++; CNT[$7]++; next} CNT[$1] > 1 && CNT[$3] > 1 && CNT[$5] > 1 && CNT[$7] > 1 ' file file
5^5 + 18^2 = 15^3 + 74^1    (3125, 324, 3375, 74)
5^5 + 32^2 = 8^4 + 53^1    (3125, 1024, 4096, 53)
5^5 + 60^1 = 14^3 + 21^2    (3125, 60, 2744, 441)

For increasing the count limit, set all the 1 s to 3 for the four comparisons in the second part.
And, yes, you're right: awk is a very powerful tool for text file analyses...

uncleMonty · June 15, 2017, 2:08pm

Thank you Don. I corrected "cnt[$5]" to "c[5]" and it worked (on bash in OS X 10.6).

---------- Post updated at 02:08 PM ---------- Previous update was at 02:03 PM ----------

Thanks Rudi. That works great. So far I avoided learning awk except for the simplest tasks, because the flow control in snippets I've grabbed here and there is so terse (to me it appears non-existent in your and Don Cragun's solutions). But I've just got hold of Dale Dougherty's book on sed and awk, and I'll be learning how this works now. (Where's the loop for reading the file on the first pass? Where's the conditional print? Etc... I can already see from looking over Dougherty's book that the answers are there, no need to answer.)

Don_Cragun · June 15, 2017, 2:52pm

Hi uncleMonty,
I apologize for the typo. It has now been corrected in my earlier post. Note that cnt[$5] should have been changed to c[$5] (NOT c[5] )!.

The general form of an awk command (as I'm sure you will find in your book (or the [codei]awk[/icode] man page on your system) is:

condition { action }

If the condition is not present, the given action is applied to every input line that gets to that statement. If action and the surrounding braces are not present, a default action of print (which prints the current input line after any modifications to that line's contents applied by any previous statements have been applied) is taken for any line in which condition evaluates to a non-zero, non-empty string value. So, the awk statement:

c[$1] >= cnt && c[$3] >= cnt && c[$5] >= cnt && c[$7] >= cnt

prints any line for which the count of the number of times the contents of fields 1, 3, 5, and 7 have all all been seen cnt or more times.

uncleMonty · June 16, 2017, 10:12am

Yes, thanks for the description. And I've learned that the loop through the file the first time to build the array of counters is kept separate from the second loop that prints the line, via the `NR==FNR` trick that I read a good account of in the "two-file processing" section of this webpage: [EDIT: I thought I posted this yesterday but apparently I don't have enough "juice" to give a url on this forum. But the helpful webpage I was just consulting can be found on the backreference.org site with the title "idiomatic awk".]