All possible combinations problem

Hi, try:

awk '
  {
    match($0,/1+/)
    b=substr($0,1,RSTART-1)
    e=substr($0,RSTART+RLENGTH,length)
    for(i=2^RLENGTH-2; i>0; i--) {
      s=x; d=i
      while(d) {
        s=(d%2==0?0:1) s
        d=int(d/2)}
        printf "%s%0" RLENGTH "s%s\n",b,s,e
    }
  }
'

Output:

b_0000000001100000000000
b_0000000000100000000000
b_0000000000100000000000
b_0000000000110000000000
b_0000000000100000000000
b_0000000000010000000000

Hello,

I have to create all possible combinations of some string data and if possible, would like to be able to do this with a script instead of a full blown cpp program.

The input data is string and looks like,
b_00000000011100000000000

For a given input string, I need to generate all of the strings that could be considered subsets of the input. This just means any combination of matching 1s up to n-1. The leading b_ is not relevant. For example,

for input string,
b_00000000011100000000000

output,
b_00000000011000000000000
b_00000000001100000000000
b_00000000010100000000000
b_00000000010000000000000
b_00000000001000000000000
b_00000000000100000000000

This is fairly straightforward for a string with three 1s, but I have strings with up to 10 1s, so this needs to be done algorithmically. If there were 10 1s, I would need to generate all of the overlapping patterns of 9 1s and smaller.

I guess if I was programming this in cpp, I would start with each single 1 and then add each other 1 to make all of the pairs, then use the pairs to make the triplets, etc. Code for this kind of thing often looks like creating paths with a breadth first search, like adding neighbor vertex numbers. I have no idea how to do that kind of thing in a scripting language.

If anyone can give me some suggestions, that would be a big help.

LMHmedchem

Somehow, my post came on top of the thread instead of the last post, whereas post #2 is the original post.. Hmm.. No sure how to correct this...
Have not got time for it now...

1 Like

I tried to reply to your post and my reply was added on to my original post as an edit and your post was simply gone.

Gremlins aside, thank you very much for your post.

I tried your code like,

#! /bin/bash

input_string='bk_00000000011100000000000'

echo $input_string | 
awk '
  {
    match($0,/1+/)
    b=substr($0,1,RSTART-1)
    e=substr($0,RSTART+RLENGTH,length)
    for(i=2^RLENGTH-2; i>0;  i--) {
      s=x; d=i
      while(d) {
        s=(d%2==0?0:1) s
        d=int(d/2)}
        printf "%s%0" RLENGTH "s%s\n",b,s,e
    }
  }
' 

The output is very close,

b_00000000011000000000000
b_00000000010100000000000
b_00000000010000000000000
b_000000000 1100000000000
b_000000000 1000000000000
b_000000000  100000000000

You can see that there are missing 0s for some of the output strings. In some cases, there is one 0 missing, in others, there are two.

I could probably hack this to replace space with 0 if necessary. I am also wondering if this would work with non-consecutive 1s such as for the example.

b_10000000000100000000001

or

b_00000000010110000000000

etc.

I tried the above script with an input string with many more 1s and there is no output at all. This is the string I tried.

b_10001010011100110101000

LMHmedchem

Hi LMHmedchem...

You do realise that this 23 bit word would give 8388607 combinations if all bits were set to 1?!

Is this a possibility along with just any single random bit which would not have a subset at all...

Or am I missing something?

1 Like

Yes, this is theoretically true. In my data, the most complex example has 11 bits set to 1. That is still quite a large number of combinations. There will never be a case with 0 on bits, but there will be some with 1 on bit that has no subsets as you point out.

It may be sufficient for me to analyze subesets up to some size, such as 5 on bits or something like that, I don't know yet.

At any rate, one of the reasons to use an automated algorithm is to be able to chew through very large numbers of permutations. At the end of the day, I need to compare the all possible subsets list to an actual list to see how many matches there are. The total number of matches will be very small, even if the list of possible matches is on the order of 10^7. I really don't care very much if this takes a long time to run (as long as it's not days), so I just have to get this automated so I can start looking at results.

LMHmedchem

If I correctly understand what you're trying to do (and I'm not sure that I do), you might want to try:

#! /bin/bash

input_string="${1:-"bk_00000000011100000000000"}"

echo "$input_string" | awk -F1 '
# Compute 2**p - 1 for p >= 1
function two_e2m1(p,	i, v) {
	v = 0
	for(i = 1; i < p; i++)
		v = 2 * v + 1
	return(v)
}
NF {	printf("%s is input to be processed.\n", $0)
	for(i = two_e2m1(NF) - 1; i > 0; i--) {
		v = i
		for(j = 1; j < NF; j++) {
			d[NF - j] = v % 2
			v /= 2
		}
		for(j = 1; j < NF; j++)
			printf("%s%d", $j, d[j])
		print $NF
	}
}'

which, when invoked with no operands produces the output:

bk_00000000011100000000000 is input to be processed.
bk_00000000011000000000000
bk_00000000010100000000000
bk_00000000010000000000000
bk_00000000001100000000000
bk_00000000001000000000000
bk_00000000000100000000000

and, when invoked with the operand bk_001001001001000000000 produces the output:

bk_001001001001000000000 is input to be processed.
bk_001001001000000000000
bk_001001000001000000000
bk_001001000000000000000
bk_001000001001000000000
bk_001000001000000000000
bk_001000000001000000000
bk_001000000000000000000
bk_000001001001000000000
bk_000001001000000000000
bk_000001000001000000000
bk_000001000000000000000
bk_000000001001000000000
bk_000000001000000000000
bk_000000000001000000000

As always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

1 Like

Thank you for that code, it seems to work well.

I have checked the output on some patterns up to 5 on bits (5 1s) and it looks correct. As far as I can tell, there should be 2^n output patterns where n=the number of on bits. Do I have that right? For my test pattern of 5 on bits, there are 2^5=25 output patterns.

This is the output organized for simplified analysis,

# for the input pattern
bk_00110000000000110001000

# 4 on bits (change 1 1 to 0)
bk_00010000000000110001000
bk_00100000000000110001000
bk_00110000000000010001000
bk_00110000000000100001000
bk_00110000000000110000000

# 3 on bits (change 2 1s to 0)
bk_000000000000001100011000

bk_00010000000000010001000
bk_00010000000000100001000
bk_00010000000000110000000

bk_00100000000000010001000
bk_00100000000000100001000
bk_00100000000000110000000

bk_00110000000000000001000
bk_00110000000000010000000
bk_00110000000000100000000

# 2 on bits (change 3 1s to 0)
bk_00000000000000010001000
bk_00000000000000100001000
bk_00000000000000110000000

bk_00100000000000010000000
bk_00100000000000100000000
bk_00100000000000000001000

bk_00010000000000010000000
bk_00010000000000100000000
bk_00010000000000000001000

bk_00110000000000000000000

# 1 on bit (change 4 1s to 0)
bk_00000000000000000001000
bk_00000000000000010000000
bk_00000000000000100000000
bk_00010000000000000000000
bk_00100000000000000000000

As far as I can tell, this is what the output should be. Please let me know if anyone sees anything amiss.

The next thing I need to do is to add a function to check each subset generated against a list of subsets and look for matches.

My revised script looks like,

#! /bin/bash

function check_against {

   check_string=$1
   pattern_match=0

   # declare list of patterns to check agains
   check_list=( bk_00110000000000000001000 \
                bk_00110000000000000001000 \
                bk_00010000000000000001000 \
                bk_00010000000000000001000 \
                bk_00010000000000000001000 \
                bk_00010000000000000001000 \
                bk_10001010011100000101000 \
                bk_00001010111100000101000 \
                bk_10001110000000000101010 \
                bk_10001110000000110101000 \
                bk_11001110000000110101000 \
                bk_11110011011100110000000 \
                bk_00110000000000110000010 )

   # loop through check_list and compare each element to check_string
   for check_against_string in "${check_list[@]}"
   do
      if [ "$check_string" == "$check_against_string" ]; then
         pattern_match=$((pattern_match+1))
      fi
   done
   # if any matches were found, output match and number of matches
   if [ "$pattern_match" != "0" ]; then
      echo -e "$check_string\t$pattern_match"
   fi

}

# input string
input_string="${1:-"bk_00110000000000110001000"}"

# capture output of awk into string
subsets_list=$(

echo "$input_string" | awk -F1 '
# Compute 2**p - 1 for p >= 1
function two_e2m1(p,	i, v) {
	v = 0
	for(i = 1; i < p; i++)
		v = 2 * v + 1
	return(v)
}
NF {	printf("%s is input to be processed.\n", $0)
	for(i = two_e2m1(NF) - 1; i > 0; i--) {
		v = i
		for(j = 1; j < NF; j++) {
			d[NF - j] = v % 2
			v /= 2
		}
		for(j = 1; j < NF; j++)
			printf("%s%d", $j, d[j])
		print $NF
	}
}'

)

# parse subsets_list on newline
IFS=$'\n' read -rd '' -a check_list <<<"$subsets_list"

for element in "${check_list[@]}"
do
   # pass each element to check against function
   check_against $element
done

This is a bit of a hack, but I capture the output of Don Cragun's awk code in a string variable and then parse that into an array on newline. I then iterate through the array and pass each element to a function that compares the element against an array and counts the number of matches found. If any matches are found, the matching pattern is printed along with the number of matches.

This gives me the output I need as far as I can tell. Please let me know if anyone sees problems with this approach or has other suggestions.

LMHmedchem

Please be aware that 2^5 = 32, not 25 (= 5^2).
If your input has all bits of interest set, and the all zero variant is to be excluded, you'll have 2^n - 2 patterns to work on.

1 Like

I guess my question at this point is whether or not the above code is giving my all of the combinations I am looking for. For the examples I tested, the output looks correct, meaning that I can't come up with any combinations that are missing. It also appears that I counted the output incorrectly. There are 30 subsets printed by Don Cragun's code, which matches your expected number.

My most complicated patter has 11 on bits, which corresponds to 2046 subsets according to your algorithm. Each of these will be checked against a check list of ~2000 patterns. This will likely not be lightning fast using the script above, but do you see any potential issues with memory?

LMHmedchem

Holding two arrays of ~2000 elements each in memory shouldn't be too much a problem, but - without completely understanding your problem nor approach - there might be another approach that might work. Do you just need to check if the bits set in an input value are allowed by a "mask"? Then e.g. an "OR" operation might do...

As RudiC has already stated, 2^5 is 32 (5^2 is 25). And the output my awk script produces for n "1" bits set is the input sample provided on the first line of output followed by ((2^n) - 2) lines treating the "1" bits in the input as bits in a binary number and each output line counts that binary number down from ((2^n)) - 1 to 1 . Which is a trivial arithmetic progression that guarantees that there are no duplicates in the output. This approach should not have any problems at all until you get into a binary number big enough that awk would treat it as a floating point value instead of as an integer value. With the maximum of eleven "1" bits you stated would be present in your input samples, this will NOT be a problem.

It is not arranged by decreasing number of bits turned on as you do in your sample output "organized for simplified analysis". I find analyzing it that way to be much more difficult since you have to spend more time determining which subsets of the output should be included in each set. If you just look at the "1" bits in the output lines generated, the binary countdown should be easy to follow and verify as long as you understand what you're looking for.

If you really need a script to verify that the output my script produces does not include any duplicates, a very simple way to do that would be to use:

awk '$1 in a{print $0 "duplicated"}{a[$1]}END{print NR, "lines processed"}'

to filter the output produced by my earlier script. The value printed at the end should always be (2^n) - 1 where, again, n is the number of "1"s in the input string.

If you can't visually determine that the output produced from my script does not include any "1" bits where there weren't any "1" bits in the input, you could also verify that with an awk script, but given my confidence in my code, I won't take the time to write that additional script for you.

Checking for a set bit outside a "mask" value can be easily done with "binary" operations; unfortunately, awk doesn't allow for this, afaik. bash can do it:

V1=$((2#1))
V8=$((2#01000))
MASK=$((2#11100))
echo $MASK
28
echo $((V1 | MASK))
29
echo $((V8 | MASK))
28
echo $(((V8 | MASK) == MASK))
1
echo $(((V1 | MASK) == MASK))
0

You see that bit0 is NOT in MASK, but bit7 is.