Merge two strings by overlapped region

Thanks DGPickett!
That's too comprehensive to integrate your idea into my code. What I lack is the practice, and I was wondering if anyone can correct my code. I like to learn by example, and not sure I should take some theory courses, if that would help.
For sure I have not yet caught the whole picture of "stack" and "heap" in theory, blank with them when handle situations in real C code.
Thank you again.
yifangt

Stack and heap are all just memory inside that 4-gigabyte array of bytes.

I repeat:

Fixing it to the point it will work isn't a matter of finding the bit which is "stopping it from working". The fundamental design of your program is flawed, and will continue to be flawed, even if we correct the obvious things.

I will rewrite some of it instead. Give me some time.

1 Like

Since VM can be mapped anywhere, they started by putting the environment, code, constants, initialized vaiables and unitialized variables in the bottom of memory, the heap, and the stack is started at top top of memory. growing toward each other. I think signal flags and open files are really in the kernel, and the fd is just an offset into a kernel per-process array of pointers to open files, so many processes can have the same open file on one or more fd numbers. The loader puts the arguments' pointers in an array with a terminating null pointer and starts up main(). If you wrote in C++, static objects' constructors would run before main(). Suppose you call strcmp(x,y). The code will push the adresses of x and y into the stack, and maybe allow a return value space, and call the strcpy code. Often values pushed are promoted to 4 byte integer even if short or char. If strcpy has automatic variables, they are located below the stack pointer setting after the call arguments. For calls within calls, each set of arguments and auto variables is called a stack frame. So, even though nobody else knows the address of a subroutine's static variables, the compiler/linker knows. The subroutine could pass a pointer to it, or load a global pointer with its address, to make it visible to others. Some even return it, which is not usual since it might not be MT-Safe! For instance, if you return a char* of a static char[50] with a null terminated string in it, unless it is const, someone might rewrite it, and if it is a product of the call input arguments, the next call will change the value, usually to something the first recipient of the pointer did not ask for. Some of the time libraries are like this, and have newer, safe variations.

1 Like
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

void strmerg(char * const out, const char *str1, const char *str2) {
        // Position in 'str1'.  Counts backwards from end.
        // Each loop, str1+a will be 1 longer, i.e.  "t", "at", "tat", "atat", "Gatat"...
        int a=strlen(str1)-1;
        // How many bytes of overlap has been found.  0 if none.
        int found=0;
        // How many characters of overlap to check for.  Counts up from 1.
        int b=1;

        while(a >= 0) // Loop until a is negative
        {
                // Compare the last 'b' bytes of str1, to the first 'b' bytes of str2.
                // Using strncmp instead of strcmp prevents it from checking ALL of str2,
                // because strncmp takes maximum length as an argument.
                // It will return 0 if they are equal.
                if(strncmp(str1+a, str2, b) == 0) found=b;
                a--;
                b++;
        }

        strcpy(out, str1);
        // Strip off the first 'found' characters of str2.
        strcat(out, str2+found);
}

int main(int argc, char *argv[])
{
        char out[4096];
        if (argc != 3) {
                printf("Error! \nUsage: ./%s string1 string2\n", argv[0]);
                exit(EXIT_FAILURE);
        }

        printf("string 1 is %s\n", argv[1]);
        printf("string 2 is %s\n", argv[2]);

        strmerg(out, argv[1], argv[2]);
        printf("Output is %s\n", out);

        return(0);
}

There's probably ways to make strmerg faster.

1 Like

This is very succinct code, and worked like charm. Thank you very much!
Questions: 1) As no malloc() was used to allocate memory for out, str1 and str2 in your strmerg() function. Is it because they are all declared const char* ?
2) In main() char out[4096] was used to hold the merged string, in practice, string 1 and string 2 can be as large as mega-bases. I thought dynamic allocation of the memory for out using a pointer would be better. It seems there is no such thing in C to dynamically allocate space for the merged string based on the two inputs (???). Can I ask if if there is anything inappropriate with my modification on main() function?

int main(int argc, char *argv[])
{
        char *out;
        if (argc != 3) {
                printf("Error! \nUsage: ./%s string1 string2\n", argv[0]);
                exit(EXIT_FAILURE);
        }

        out = malloc(sizeof(char)*(strlen(argv[1]) + strlen(argv[2]))+1);

        printf("string 1 is %s\n", argv[1]);
        printf("string 2 is %s\n", argv[2]);

        strmerg(out, argv[1], argv[2]);
        printf("Output is %s\n", out);

        free(out);
        return(0);
}

Thanks a lot!

Look closer, they're not all const char *.

It's nothing to do with malloc.

The two which are 'const char *' are set that way because I don't need to write to their contents.

The one which isn't 'const char *' is because I need to write to its contents.

I could have made them all plain 'char *' and it would have still worked. The 'const' is a reminder to me, the programmer, of which arguments I should be writing to and which I shouldn't. It's also a safety mechanism, so if I try to cram unwritable things into it, the compiler will complain.

...and having said so, you go ahead and do the "impossible" in the same breath :wink:

I avoided malloc because pointers still confuse you. I just thought it was more straightforward to do it that way.

Actually that looks fine. It allocates more memory than it needs to so isn't ideal, but will do what you want it to do, and won't crash.

Here's what I would do:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

// Returns newly-allocated memory which you must 'free' later.
char *strmerg(const char *str1, const char *str2) {
        // Note:  size_t and ssize_t are technically more correct for numbers holding string length,
        // particularly for LARGE strings.  I had to use ssize_t for 'a' since I need a negative number
        // for my while loop.

        // Position in 'str1'.  Counts backwards from end.
        // Each loop, str1+a will be 1 longer, i.e.  "t", "at", "tat", "atat", "Gatat"...
        ssize_t a=strlen(str1)-1;
        // How many bytes of overlap has been found.  0 if none.
        size_t found=0;
        // How many characters of overlap to check for.  Counts up from 1.
        size_t b=1;

        while(a >= 0) // Loop until a is negative
        {
                // Compare the last 'b' bytes of str1, to the first 'b' bytes of str2.
                // Using strncmp instead of strcmp prevents it from checking ALL of str2,
                // because strncmp takes maximum length as an argument.
                // It will return 0 if they are equal.
                if(str1[a] == str2[0]) // Optimization
                if(strncmp(str1+a, str2, b) == 0) found=b;
                a--;
                b++;
        }

        b=strlen(str1); // don't count 5 megabases more times than necessary

        // Code block only because some compilers require you to declare
        // all variables at the top of a code block
        {
                char * const out=malloc(sizeof(char)*(b+strlen(str2+found)+1));
                strcpy(out, str1);
                strcpy(out+b, str2+found); // Faster than strcat
                return(out);
        }
}

int main(int argc, char *argv[])
{
        if (argc != 3) {
                printf("Error! \nUsage: ./%s string1 string2\n", argv[0]);
                exit(EXIT_FAILURE);
        }
        else
        {
                char * const out=strmerg(argv[1], argv[2]);
                printf("string 1 is %s\n", argv[1]);
                printf("string 2 is %s\n", argv[2]);
                printf("Output is %s\n", out);
                free(out);
        }
        return(0);
}
1 Like

Here is a strmerg() similar to Corona688's. But, where his version checks for a match for every trailing substring of string 1; this version starts with the longest possible match based on the length of the shorter string and stops as soon as it finds a match. It also verifies that malloc() succeeded before copying data into the buffer allocated by malloc(). With multi-megabyte input strings of similar length and short matches, the speed difference is likely to be unnoticeable. But if there are frequent relatively long matching strings or if string 1 is longer than string 2, this version might be significantly faster.

This won't work on some older systems, because it uses <inttypes.h> to produce relatively portable printf() conversion specifications for objects of type size_t. And, Corona688's code makes much better use of the const qualifier, than this code does.

#include <inttypes.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>

char *
strmerg(const char *str1, const char *str2) {
	size_t	str1_l = strlen(str1),	// length of string1
		str2_l = strlen(str2),	// length of string2
		// match_l is longest possible match (length of shortest input)
		match_l = str1_l < str2_l ? str1_l : str2_l;
		// base is ptr to start of longest possible match in string1
	char	*base = (char *)str1 + str1_l - match_l,
		c1,		// 1st char in string2 possible match
		c2 = *str2,	// 1st char in string2
		*out;
	// At this point, match_l is the longest possible match (based on the
	// lengths of string1 and string2) and base points to the spot in
	// string1 where the longest possible match could start.

	// While there is a mismatch...
	while((c1 = *base) && ((c1 != c2) || strncmp(base, str2, match_l))) {
		// Decrement the longest possible match, and increment the
		// spot in string1 where the longest possible match could
		// still start.  Note that even though we aren't checking
		// the value of match_l in this loop, we are guaranteed that
		// the loop will end with match_l greater than or equal to
		// zero.  (If it is zero, the only match is the terminating
		// null byte in string1.)
		match_l--;
		base++;
	}
	// When we get to this point, match_l is the length of the longest
	// substring of the tail of string1 that matches the head of string2.

	// Allocate space to hold the resulting string:	length of string1 +
	// length of string2 - length of matched substring + 1 for the
	// terminating null byte.  NOTE:  The caller is responsible for
	// freeing the allocated space when it is no longer needed.
	out = malloc(str1_l + str2_l - match_l + 1);
	if(out) {
		// Copy string1 and unmatched portion of string2 to out.
		strcpy(out, str1);
		strcpy(out + str1_l, str2 + match_l);
	}
	return(out);
}

int
main(int argc, char *argv[]) {
	char *result;	// Space for the pointer that strmerg() will return

	if (argc != 3) {
		fprintf(stderr, "Error! Bad arg count\nUsage: %s %s\n",
			argv[0], "string1 string2");
		exit(EXIT_FAILURE);
	}

	printf("string1: \"%s\" (%"PRIuMAX" bytes)\n",
		argv[1], (intmax_t)strlen(argv[1]));
	printf("string2: \"%s\" (%"PRIuMAX" bytes)\n",
		argv[2], (intmax_t)strlen(argv[2]));

	result = strmerg(argv[1], argv[2]);
	if(result == NULL) {
		perror("strmerg() failed:");
		exit(EXIT_FAILURE);
	}
	printf("strmerg(string1, string2) output: \"%s\" (%"PRIuMAX" bytes)\n",
		result, (intmax_t)strlen(result));
	free(result);
	result = strmerg(argv[2], argv[1]);
	if(result == NULL) {
		perror("strmerg() failed:");
		exit(EXIT_FAILURE);
	}
	printf("strmerg(string2, string1) output: \"%s\" (%"PRIuMAX" bytes)\n",
		result, (intmax_t)strlen(result));
	free(result);
	return(0);
}

When compiled and linked into a.out , the command:

./a.out ACGTGCCC CCCCCGTGTGTGT

produces:

string1: "ACGTGCCC" (8 bytes)
string2: "CCCCCGTGTGTGT" (13 bytes)
strmerg(string1, string2) output: "ACGTGCCCCCGTGTGTGT" (18 bytes)
strmerg(string2, string1) output: "CCCCCGTGTGTGTACGTGCCC" (21 bytes)

and the command:

./a.out aaaaaaaaaa1 1aaaaaaaaaaaa

produces the output:

string1: "aaaaaaaaaa1" (11 bytes)
string2: "1aaaaaaaaaaaa" (13 bytes)
strmerg(string1, string2) output: "aaaaaaaaaa1aaaaaaaaaaaa" (23 bytes)
strmerg(string2, string1) output: "1aaaaaaaaaaaa1" (14 bytes)
2 Likes

Corona & Don!
These two should be added to the <string.h> as strmerg() function like the others, e.g. strcat(), strcpy() ... etc.

Don's is better. Wish I'd thought of it :slight_smile:

You're not supposed to modify string.h

You can put it in your own .h file easily enough.

#ifndef __MYSTR_H__
#define __MYSTR_H__

#ifdef __cplusplus
extern "C" {
#endif

char *strmerg(const char *str1, const char *str2);

#ifdef __cplusplus
}
#endif

#endif/*__MYSTR_H__*/
/* mystr.c */
#include <string.h>
#include <stdlib.h>

char *
strmerg(const char *str1, const char *str2) {
	size_t	str1_l = strlen(str1),	// length of string1
		str2_l = strlen(str2),	// length of string2
		// match_l is longest possible match (length of shortest input)
		match_l = str1_l < str2_l ? str1_l : str2_l;
		// base is ptr to start of longest possible match in string1
	char	*base = (char *)str1 + str1_l - match_l,
		c1,		// 1st char in string2 possible match
		c2 = *str2,	// 1st char in string2
		*out;
	// At this point, match_l is the longest possible match (based on the
	// lengths of string1 and string2) and base points to the spot in
	// string1 where the longest possible match could start.

	// While there is a mismatch...
	while((c1 = *base) && ((c1 != c2) || strncmp(base, str2, match_l))) {
		// Decrement the longest possible match, and increment the
		// spot in string1 where the longest possible match could
		// still start.  Note that even though we aren't checking
		// the value of match_l in this loop, we are guaranteed that
		// the loop will end with match_l greater than or equal to
		// zero.  (If it is zero, the only match is the terminating
		// null byte in string1.)
		match_l--;
		base++;
	}
	// When we get to this point, match_l is the length of the longest
	// substring of the tail of string1 that matches the head of string2.

	// Allocate space to hold the resulting string:	length of string1 +
	// length of string2 - length of matched substring + 1 for the
	// terminating null byte.  NOTE:  The caller is responsible for
	// freeing the allocated space when it is no longer needed.
	out = malloc(str1_l + str2_l - match_l + 1);
	if(out) {
		// Copy string1 and unmatched portion of string2 to out.
		strcpy(out, str1);
		strcpy(out + str1_l, str2 + match_l);
	}
	return(out);
}
1 Like

Thanks Corona!

Before I start my code, I searched for a while trying to find this string merge function like strcat(), strcpy() etc in the <stdlib.h>, and turned out there is no function for this purpose, surprisingly. I feel this strmerg() function is quite daily needed for bioinformatics, although this is not exactly the longest common string problem (LCS) in algorithm, that's another story. Very good answers, thank you again!

C's library of string functions isn't especially powerful. Regular expressions aren't included either (though there is an extension, -lregex). There's better languages for that.

1 Like