Issue in grep

sharad.40216 · June 28, 2012, 12:08pm

i have following pattern in file

s6:s2
s2:s4
s1:s2:s3:s4:s5:s6
s1
.
.

Now i want to find occurence of each record in file like s6:s2 occurs twice {once in first record and both occur in 3 record as well}

so output should be

s6:s2   2
s2:s4   2
s1:s2:s3:s4:s5:s6  :1
s1 : 2
                        ........

How to achive this.Also there are millions of records.so looking for best approach

Scrutinizer · June 28, 2012, 12:16pm

Try:

sort file | uniq -c

sharad.40216 · June 28, 2012, 12:28pm

With the above command we will get count as 1 for each record as each is unique in my example.

I want the occurence of the combination(one to many) in all records in file irrespective of the order.

vbe · June 28, 2012, 12:35pm

I dont think there is a simple way... you will have to start by creating a file or patterns (unique) then using a loop reading the pattern file count the occurences...

elixir_sinari · June 28, 2012, 12:39pm

OK... we'll talk about this record :

s6:s2

.

What if s6 occurs 2 times and s2 occurs 6 times? What would be the expected output for such a record?

alister · June 28, 2012, 1:09pm

If i understood the original post, the goal is to count, for each record in the file, how many times all of the fields in that record occur together in a record (including itself), irrespective of order. If that's correct, then the answer to your question is, "it depends on how many times s2 and s6 are part of the same record."

Regards,
Alister

ctsgnb · June 28, 2012, 1:29pm

Ok so your records (your lines) are made of a combination of 'simple' elements sX (no matter the order)

Is the number of simple elements numerous or limited ? (do these elements can be represented by letters ? prime numbers ? processors on which you parallelize tasks ?
Do these simple elements have a relation ship between them ?
Does it happen that a simple element occur more than once in a record ?

methyl · June 28, 2012, 4:31pm

Can you give us some real world scenario for this process?

Is the data sample a representative example of the real data? As posted, the short field length and the limited variation (six different) keeps the file relatively small and the permutations relatively low. As posted, there is potential to filter non-numeric characters and compose a ranking from the numbers. Complete waste of time if this is not representative data.

Do you have a mainstream database engine and programming effort available?
Was the data extracted from a database? Can the processing be done without extracting the data to a flat file?