Assuming that the 1st line from:
10.0.0.2 10.0.0.1
10.0.0.5 10.0.0.1 10.0.0.2
should also be removed as a subset of the 2nd line and that the 1st line from:
10.0.0.2
10.0.0.21
should not be removed as a subset of the 2nd line; I can't see how sed
would be a good tool for attacking this problem. The periods in IP addresses also need to be handled specially in sed
since they have a special meaning in regular expressions. At first glance, awk
would seem to me to be much more suited for this problem than sed
. I haven't done enough with python
to compare awk
and python
for a problem like this.
And, while sort -u
will get rid of duplicate lines, it won't reduce the problem you need to solve with input like:
10.0.0.5 10.0.0.1 10.0.0.2
10.0.0.5 10.0.0.2 10.0.0.1
10.0.0.1 10.0.0.5 10.0.0.2
10.0.0.1 10.0.0.2 10.0.0.5
10.0.0.2 10.0.0.1 10.0.0.5
10.0.0.2 10.0.0.5 10.0.0.1
10.0.0.1 10.0.0.2
10.0.0.1 10.0.0.5
10.0.0.2 10.0.0.1
10.0.0.2 10.0.0.5
10.0.0.5 10.0.0.1
10.0.0.5 10.0.0.2
10.0.0.1
10.0.0.2
10.0.0.5
which I assume should be reduced to a single output line that matches one of the 1st six lines above.
But, if you have a lot of duplicate lines in your input, using sort -u
as a preprocessing step may reduce your overall run time. Without a much deeper understanding of how many duplicates there are and how you process lines while looking for duplicates, we would just be making wild guesses as to whether using sort -u
would increase or decrease run times.
Preprocessing your input so that all of the IP addresses in each line are in sorted order might also be an advantage. Whether or not it would help or not again depends on what your input looks like and how you are looking for duplicates. It might also speed up processing if your input was sorted with the longest sequences coming first.
Is this a one-time problem; or is this something you're going to be doing on a regular schedule? Getting a correct result is crucial. Getting a correct result as quickly as possible might not be important if you only need to do this once.
If you show us your code, we have a MUCH better chance of helping you correct or optimize what you're doing.
Giving us more details about your input could also make a huge difference in the way some of us would approach a problem like this. What are the minimum and maximum number of IP addresses in your sequences? Do all of the IP addresses in a single sequence have the same 1st component (i.e. 10.0.0.1
, 10.0.0.2
, and 10.0.0.5
)? Everything you know about your data could provide hints that could optimize code used to solve this problem.
You have more than 100K input sequences. How much "more than"? What OS are you using? What hardware are you using? How much memory do you have?