Merge two strings by overlapped region

Hello, I am trying to concatenate two strings by merging the overlapped region. E.g.

Seq1=ACGTGCCC
Seq2=CCCCCGTGTGTGT
Seq_merged=ACGTGCCCCCGTGTGTGT

Function strcat(char *dest, char *src) appends the src string to the dest string, ignoring the overlapped parts (prefix of src and suffix of dest). Googled for a while, this seems to be related to longest common substring computing, which is a too big question for me.
I have tried following code, but always got an error: Seq_merged=ACGTGCCCCCCGTGTGTGT, which has an exra "C". What did I miss?

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAXLEN 4096

//strmerg was from: http://effprog.wordpress.com/2010/11/18/concatenation-of-two-strings-omitting-overlapping-string/
char *strmerg(char *dst, const char *src)
{
    size_t dstLen = strlen(dst);
    size_t srcLen = strlen(src);

    char *p = dst + dstLen + srcLen;            /* Pointer to the end of the concatenated string */
    const char *q = src + srcLen - 1;            /* Pointer to the last character of the src */
    char *r = dst + dstLen - 1;                    /* Temp Pointer to the last character of the dst */
    char *end = r;                                /* Permanent Pointer to the last character of the dst */
    *p = '\0';                                    /*terminating the concatened string with NULL character */

    while (q >= src) 
{        /*Copy src in reverse */
    if (*r == *q) {                                /*Till it matches with the src, decrement r */
        r--;
    } else {
        r = end;
        if (*r == *q) {
        r--;
        }
    }

    *p-- = *q--;
    }

    while (r >= dst)                            /*Copy dst, ending with r */
    *p-- = *r--;

    return p + 1;
}

int main(int argc, char **argv)
{
    char *str1, *str2;        //Original two strings
    char *str3;                //resulting string

    str1 = malloc(sizeof(char) * MAXLEN);    //allocate memory
    str2 = malloc(sizeof(char) * MAXLEN);    //allocate memory

    str3 = malloc(sizeof(char) * MAXLEN * 2);    //allocate memory, maximum space needed is the sum of the two original string lengths

    if (argc != 3) {
    printf("Error! \nUsage: ./arg[0]=program argv[1]=string1 argv[2]=string2\n");
    exit(EXIT_FAILURE);
    }

    strcpy(str1, argv[1]);
    strcpy(str2, argv[2]);

    printf("Input strings are: \nSeq1=%s\nSeq2=%s\n", str1, str2);

    str3=strmerg(str1, str2);
    printf("\nConcatenated string is: Seq_merged=%s\n", str3);
/*Some problem with these free(), do not know why?
free(str1);
free(str2);
free(str3);
*/
    return 0;
}

I tried more cases, it seems the problem comes if the overlapping region is repetitive.

./prog ACGTGCCC CCCCCGTGTGTGT 
Seq1=ACGTGCCC
Seq2=CCCCCGTGTGTGT 
Seq_merged=ACGTGCCCCCCGTGTGTGT 
./prog ACGTGatcg atcgCCGTGTGTGT
Seq1= ACGTGatcg
Seq2= atcgCCGTGTGTGT
Seq_merged=ACGTGatcgCCGTGTGTGT
./prog ACGTGatatat atatCCGTGTGTGT
Seq1=ACGTGatatat
Seq2=atatCCGTGTGTGT
Seq_merged=ACGTGatatatatatCCGTGTGTGT

Can anyone have a look at it for me? Thanks a lot!

Trim string1 end as far as concatenated new + string2 still contains string1?

1 Like

In case it matters to you, be aware that your initial strcpy's from argv are unsafe (if the command line arguments exceed your definition of MAXLEN).

Regards,
Alister

1 Like

Use the argv where they lay (in the heap already, part of environment), just assign the char* to a identifying variable, and do not make a copy. If you must copy, malloc for the strlen+1 or go to C++ RWCString, JAVA. You just need cha* for str1 and str2, a dynamically sized char[]'s for last good and trial of strlen(str1)+strlen(str2)+1. The test is memcmp(str1, trial, strlen(str1)). When you trim str1 to nothing or it mismatches, the last good is it.

You do not need to free() when you exit(), exit() does it all: fflush(), fclose(), close() (socket disconnect, TCP DB session rollback) and virtual free(). All memory for a process is released on exit(). Memory leaks are a problem for daemons, which almost never exit(), and internal processing loops.

You cannot copy a char[] with = here: str3=strmerg(str1, str2); You destroyed the value of str3 placed there by malloc. the only clue it can use to free(), a memory leak since you did not save that value for free(). Subroutines that return char[] can either use a static but that is a vlaue destroyed at the next call, not MT-Safe, or malloc a new buffer to return, whose free() falls on the caller, or more usually the caller should send it in as an additional arg, and if the size is not explicit, with a size, like with the improvement of sprintf() to snprintf(), localtime() to localtime_r(): OPENSOLARIS Man Pages and OPENSOLARIS Commands at the UNIX and Linux Forums OPENSOLARIS Man Pages and OPENSOLARIS Commands at the UNIX and Linux Forums

1 Like

Thanks for your replies!
DGPickett, could you be more specific on these two places and please comments on my code?

  1. just assign the char* to a identifying variable, and do not make a copy. What is the correct way?
  2. You cannot copy a char[] with = here: str3=strmerg(str1, str2); You destroyed the value of str3 placed there by malloc.
    Do you mean str3 = malloc(sizeof(char) * MAXLEN * 2); this line is not needed?
    Thank you!

1) Simplest thing in the world:

const char *str1 = argv[1];
const char *str2 = argv[2];

The limitation of this, of course, is that it's just pointer pointing to the same memory as argv[1]. That doesn't matter since argv[1] isn't going to change and str1 won't be edited.

They are 'const' so that you can't edit them by accident, it'd be a compiler error (or a lot of intentional typecasting) to try. Any string parameters your functions take which don't get edited should be 'const' too, to signify this. For example:

STRCPY(3)                  Linux Programmer's Manual                 STRCPY(3)



NAME
       strcpy, strncpy - copy a string

SYNOPSIS
       #include <string.h>

       char *strcpy(char *dest, const char *src);

'src' for strcpy will accept both constant and non-constant strings because of this, but if you try to put a constant string into 'dest', it will cause a compiler error. This is better than a crash later.

1 Like

Thanks Corona688!

Got the idea to use const char * for my case. However, after I changed the two lines,

char *str1 = argv[1]; 
const char *str2 = argv[2];

my code was compiled without error/warnings, but did not give any result.

Input strings are: 
Seq1=
Seq2=
Concatenated string is: Seq_merged=

I understand the direct assignment of *str1 = argv[1], *str2 = argv[2] to pass two pointers. There must be subtle things here I have missed.

Problem:

char *str1 = argv[1]; 
const char *str2 = argv[2];

Why is that one not const?

Remember, programming isn't about "fixing compiler errors". If you get one, think about what it's telling you.

I thought str1 is the destination and will be modified, i.e. appended with str2. And using two const char * I got warnings as:

prog009c2.c:73:2: warning: passing argument 1 of �strmerg' discards �const' qualifier from pointer target type [enabled by default]
prog009c2.c:21:7: note: expected �char *' but argument is of type �const char *'

Is it? I didn't think it changed, but that may be my mistake. In any case, you have to think about the code I give you, not blindly use it -- especially not make blind changes just to "fix compiler errors". Can you tell me why I said to make these variables const? And how I told you to deal with those warnings?

As for question 2 -- you're back at square one with pointers again. What, exactly, have you assumed that str3 = ... is going to do?

My understanding of "why they are const" is to avoid modification of the parameters by accident. For my strmerg( char *dest, char *src) function, the original design is dest and src are inter-changable, i.e. src can be appended to dest and vice verse.
I did not know the const char * restriction at the beginning to avoid editing the parameters.
True, at this moment I am trying to cross the stage to "make blind changes ...to fix the compiler error". When you ask me "why I said make these variables const" I thought I understood your saying, but not really. ------Should I give up with C now? Thank you!

I declared it 'const' for a reason -- because if you try to modify them it will not work.

The compiler error happened because you tried to modify them.

Blindly removing the 'const' to fix the compiler error didn't make it work, just made the compiler error go away.

I am writing another in-depth explanation of pointers. Patience please.

1 Like

I am dumping raw memory contents to show you what happens when you declare a variable on the stack, and allocate memory with malloc.

For the purposes of this, you can ignore the contents of the 'printpage' and 'spew' functions, they're convenience functions I made to dump memory and are not related otherwise.

#include <stdio.h>
#include <unistd.h>
#include <string.h>
#include <stdlib.h>

void printpage(void *pointer, FILE *fout) {
  unsigned long p=(unsigned long int) pointer;
  unsigned long size=~(unsigned long)(getpagesize()-1);
  unsigned char *pp;
  p &= size;
  pp=(unsigned char *)p;
  fwrite(pp, getpagesize(), 1, fout);
}

void spew(const char *msg, const void *ptr) {
  FILE *fp=popen("hexdump -C", "w");
  unsigned long size=(unsigned long)(getpagesize()-1);
  printf("%7s @ %08lx(%08lx+%08lx)\n", msg,
        (unsigned long)ptr,
        ((unsigned long)ptr)&(~size),
        (unsigned long)ptr&size);
  fflush(stdout);
  printpage((void *)ptr, fp);
  pclose(fp);
  fflush(stdout);
  printf("\n");
  fflush(stdout);
}

int main(int argc, char *argv[])
{
  char *mem1=malloc(16);
  char *mem2=malloc(16);
  char *mem3=malloc(16);
  char *mem4=malloc(16);
  mem1[0]='A';  mem2[0]='B';    mem3[0]='C';    mem4[0]='D';

  printf("mem1=%p mem2=%p mem3=%p mem4=%p\n", mem1, mem2, mem3, mem4);

  spew("heap", mem1);
  spew("stack", &mem1);

  // Now, what happens if I do mem1=mem3 ?
  mem1=mem3;
  spew("heap after", mem1);
  spew("stack after",&mem1);
  return(0);
}

Remember that memory can be considered to be one gigantic array of bytes, from address 00000000 all the way up to ffffffff (on a 32-bit machine). Pointers are just array indexes inside this array.

So we allocate four blocks of 16 bytes, set their first element to something, and dump memory to see where they ended up:

mem1=0x085ef008 mem2=0x085ef020 mem3=0x085ef038 mem4=0x085ef050
   heap @ 085ef008(085ef000+00000008)
00000000  00 00 00 00 19 00 00 00  41 00 00 00 00 00 00 00  |........A.......|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 19 00 00 00  |................|
00000020  42 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |B...............|
00000030  00 00 00 00 19 00 00 00  43 00 00 00 00 00 00 00  |........C.......|
00000040  00 00 00 00 00 00 00 00  00 00 00 00 19 00 00 00  |................|
00000050  44 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |D...............|
00000060  00 00 00 00 b1 00 00 00  84 2c ad fb 00 00 8c b7  |.........,......|
00000070  00 00 8c b7 00 00 8c b7  00 00 8c b7 00 00 8c b7  |................|
00000080  00 10 8c b7 00 00 8c b7  00 10 8c b7 00 00 00 00  |................|
00000090  00 00 00 00 00 00 00 00  00 00 00 00 60 45 8b b7  |............`E..|
000000a0  04 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000b0  08 f1 5e 08 ff ff ff ff  ff ff ff ff 00 00 00 00  |..^.............|
000000c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000d0  ff ff ff ff 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000e0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000f0  00 00 00 00 00 00 00 00  00 00 00 00 00 36 8b b7  |.............6..|
00000100  5f 74 00 00 00 00 00 00  01 00 00 00 01 00 00 00  |_t..............|
00000110  c0 e6 76 b7 f1 0e 02 00  00 00 00 00 00 00 00 00  |..v.............|
00000120  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000
(continued)

...but what about mem1 itself? Something, somewhere, has to remember that address of 0x85ef008, yes? And it does, on the "stack", which is a big block of memory which the processor uses as temporary space.

There's actually quite a lot of stuff on the stack. It's not just us that's using it. Every time you call a function, it uses stack to pass arguments, create local variables, and remember where to return. It gets used so much, in fact, that you can't trust that a local variable doesn't contain previously-used garbage values unless you set it to anything else yourself.

  stack @ bf8596b0(bf859000+000006b0)
00000000  81 cb 7a b7 c0 44 8b b7  84 88 04 08 02 00 00 00  |..z..D..........|
00000010  00 00 00 00 88 88 04 08  05 00 00 00 01 00 00 00  |................|
00000020  c0 ff ff ff c0 ff ff ff  c0 ff ff ff 54 bb 7a b7  |............T.z.|
<lots of garbage snipped>
000006a0  50 f0 5e 08 f4 3f 8b b7  c8 96 85 bf c9 87 04 08  |P.^..?..........|
000006b0  08 f0 5e 08 20 f0 5e 08  38 f0 5e 08 50 f0 5e 08  |..^. .^.8.^.P.^.|
000006c0  b0 87 04 08 d0 84 04 08  28 97 85 bf b5 5b 78 b7  |........(....[x.|
000006d0  01 00 00 00 54 97 85 bf  5c 97 85 bf 98 2a 8e b7  |....T...\....*..|
000006e0  b0 26 8c b7 01 00 00 00  01 00 00 00 00 00 00 00  |.&..............|
000006f0  f4 3f 8b b7 b0 87 04 08  d0 84 04 08 28 97 85 bf  |.?..........(...|
...
(continued)

It's there all right. Slightly out-of-order, but it's there. That's the order an x86 processor handles all numbers, nothing weird. Humans are weird in wanting it highest-to-lowest digit order instead of something easily mechanically processable, which is why we have printf to handle that job for us.

Now, we want to put a different string into mem1. What will the statement 'mem1=mem3' do?

heap after @ 085ef038(085ef000+00000038)
00000000  00 00 00 00 19 00 00 00  41 00 00 00 00 00 00 00  |........A.......|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 19 00 00 00  |................|
00000020  42 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |B...............|
00000030  00 00 00 00 19 00 00 00  43 00 00 00 00 00 00 00  |........C.......|
00000040  00 00 00 00 00 00 00 00  00 00 00 00 19 00 00 00  |................|
00000050  44 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |D...............|
00000060  00 00 00 00 b1 00 00 00  84 2c ad fb 00 00 8c b7  |.........,......|
00000070  00 00 8c b7 00 00 8c b7  00 00 8c b7 00 00 8c b7  |................|
00000080  00 10 8c b7 00 00 8c b7  00 10 8c b7 00 00 00 00  |................|
00000090  00 00 00 00 00 00 00 00  00 00 00 00 60 45 8b b7  |............`E..|
000000a0  04 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000b0  08 f1 5e 08 ff ff ff ff  ff ff ff ff 00 00 00 00  |..^.............|
000000c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000d0  ff ff ff ff 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000e0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
000000f0  00 00 00 00 00 00 00 00  00 00 00 00 00 36 8b b7  |.............6..|
00000100  61 74 00 00 00 00 00 00  01 00 00 00 01 00 00 00  |at..............|
00000110  c0 e6 76 b7 f1 0e 02 00  00 00 00 00 00 00 00 00  |..v.............|
00000120  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000

It looks like -- absolutely nothing. Except... What happened on the stack?

stack after @ bf8596b0(bf859000+000006b0)
00000000  81 cb 7a b7 c0 44 8b b7  84 88 04 08 02 00 00 00  |..z..D..........|
00000010  00 00 00 00 88 88 04 08  05 00 00 00 01 00 00 00  |................|
00000020  c0 ff ff ff c0 ff ff ff  c0 ff ff ff 54 bb 7a b7  |............T.z.|
<lots of garbage snipped>
000006a0  50 f0 5e 08 f4 3f 8b b7  c8 96 85 bf c9 87 04 08  |P.^..?..........|
000006b0  38 f0 5e 08 20 f0 5e 08  38 f0 5e 08 50 f0 5e 08  |8.^. .^.8.^.P.^.|
000006c0  b0 87 04 08 d0 84 04 08  28 97 85 bf b5 5b 78 b7  |........(....[x.|
000006d0  01 00 00 00 54 97 85 bf  5c 97 85 bf 98 2a 8e b7  |....T...\....*..|

If mem1 is a pointer to a string, mem1=... does not modify the string. It alters the pointer.

Also: free() requires the exact same pointer that malloc() gave you. If you give it a different pointer, even slightly, it will crash.

If you give it the same pointer twice, it will crash. (Which means, if we did free() on all our pointers, our program would crash right now.)

If you write beyond the end of the memory you allocated, it will probably crash. (Because, as you can see in the dump, if you write beyond that you're stomping on top of something else).

3 Likes

We've had this conversation four or five times. You have a hard time telling when you are modifying the pointer instead of its contents. But the compiler can tell you that easily, so I have a suggestion:

Whenever you do char *mem=malloc(300); ...do this instead: char * const mem=malloc(300); This will make the mistake you keep repeating a compiler error -- "assignment of read-only variable". (It of course sets it, once, when you declare it. But thereafter it's considered 'fixed'.)

You are still free to modify its contents, like with strcpy(mem, originalstring) or mem[5]='Q' or *(mem+5)=37 or any other way you please. But if you try to alter the pointer that's an error. You can understand "assignment of read-only value" to mean "whoops, I mixed up a pointer's value and its contents".

If you find yourself needing to do complicated circumlocutions to get around 'assignment of read only variable', you can be fairly sure you've taken a wrong turn somewhere.

2 Likes

When I saw your answers, I did not know what I should reply, but my feeling is a mixture of embarrassment (so little I know about C), disappointment (buggy code), discourage (so much to learn) and maybe, hopelessness (not working code). I am so far away from catching the spirits of C language as compared with your catch!
From your code I need go back the bitwise and other stuffs, not my string merge anymore. I wish I could have any comments with my code corrected side by side, so I know the correct way in that situation.
Thank you very much for your time and effort, Corona688!
Yifangt

Don't worry, you can ignore everything but main() in my code like I said. They're not relevant to your problem -- they're for debugging, they print memory, I whipped them up to make those charts.

If you can start using 'char * const', that would really help I think. The compiler would catch that mistake -- just wouldn't let you do it. (Which is mostly what const is for, FYI -- a label to inform the programmer what they can and cannot do to a variable.)

You post such large programs that fixing them pretty much means rewriting them. I can update the first two lines, sure -- but if all the lines after it are written on false assumptions, that's not much help.

It'd help us if you showed your entire program again when you made changes.

1 Like

I wonder if learning assembly language would help. The way pointers work is mostly because of how the CPU works. Seeing that can be revealing.

What I wanted is to concatenate two strings by joining the overlapping suffix of string1 and prefix of string2. The overlapping region is merged. Here is the entire program.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define MAXLEN 4096                            //Line 5, at this moment no string longer than 4096 is allowed.

//strmerg was from: http://effprog.wordpress.com/2010/11/18/concatenation-of-two-strings-omitting-overlapping-string/
char *strmerg(char *dst, char *src)                    //Line 8
{
    size_t dstLen = strlen(dst);
    size_t srcLen = strlen(src);

    char *p = dst + dstLen + srcLen;            /* Line 13, Pointer to the end of the concatenated string */
    char *q = src + srcLen - 1;                    /* Line 14, Pointer to the last character of the src */
    char *r = dst + dstLen - 1;                    /* Line 15, Temp Pointer to the last character of the dst */
    char *end = r;                                /* Line 16, Permanent Pointer to the last character of the dst */
    *p = '\0';                                    /*terminating the concatened string with NULL character */

    while (q >= src) 
{                                                /*Copy src in reverse */
    if (*r == *q) {                                /*Till it matches with the src, decrement r */
        r--;
    } else {
        r = end;
        if (*r == *q) {
        r--;
        }
    }

    *p-- = *q--;
    }

    while (r >= dst)                            /*Copy dst, ending with r */
    *p-- = *r--;

    return p + 1;                                //pointer of string, i.e. the start of string
}

int main(int argc, char **argv)                    //Line 39
{
    const char *str1 = argv[1];                 //Line 41, Original two strings
    const char *str2 = argv[2];                    //Line 42, Original two strings
    char *str3;                                    //Line 43, resulting string

    str1 = malloc(sizeof(char) * MAXLEN);        //Line 45, allocate memory
    str2 = malloc(sizeof(char) * MAXLEN);        //Line 46, allocate memory

    if (argc != 3) {
    printf("Error! \nUsage: ./arg[0]=program argv[1]=string1 argv[2]=string2\n");
    exit(EXIT_FAILURE);
    }

    strcpy(str1, argv[1]);                        //Line 53, 
    strcpy(str2, argv[2]);                        //Line 54, 

    str3=strmerg(str1, str2);                    //Line 56,
    
    printf("Input strings are: \nSeq1=%s\nSeq2=%s\n", str1, str2);
    printf("\nConcatenated string is: Seq_merged=%s\n", str3);
    printf("\nConcatenated string is: Seq_merged=%s\n", strmerg(str1, str2));
    return 0;
}

Two questions:
1a) Syntax related: accroding to the replies, what are the right way for Lines 8, 41, 42, & 43 (and probably within the strmerg() function, Lines 13, 14, 15 may need change too)?

./prog ACGTGatatat  atatGTGTGTGT  
Input strings are: 
Seq1=ACGTGACGTGatatatGTGTGTGT       //Extra ACGTG
Seq2=atatGTGTGTGT

Concatenated string is: Seq_merged=ACGTGatatatGTGTGTGT
Concatenated string is: Seq_merged=ACGTGACGTGatatatGTGTGTGT           //Extra ACGTG

1b)Similar to 1a) Syntax related: Lines 53, 54, & 56 that should be changed coordingly, especially Line 56;
2) Algorithm related: If the repetitive overlapping region in src is longer than those in dest, merged string is NOT correct!

./prog ACGTGatat   atatatGTGTGTGT  
Input strings are: 
Seq1=ACGTGACGTGatatatGTGTGTGT       //Extra ACGTG
Seq2=atatGTGTGTGT

Concatenated string is: Seq_merged=ACGTGatatatatGTGTGTGT
Concatenated string is: Seq_merged=ACGTGACGTGatatatatGTGTGTGT           //Extra ACGTG

This is a bug with the strmerg() function.
The extra ACGTG part in seq1 seems related to the const char * str1 delcaration, but I am not sure.

Thank you again!

I repeat: You've written your program around assumptions which don't hold water -- like the very idea of "str3" being a separate entity from "str1" and "str2" just because you threw pointers to the same memory into a different function. You're still modifying the memory pointed to by str1 whenever you do that!

Also, you are calling strmerge twice before you print its output -- considering str1 gets modified each time, I'm not surprised its output is doubly strange.

I suggest rewriting it from scratch with this new knowledge in mind, using that pointer suggestion I told you about so you're forced to break these habits.

Since all you need to reconstruct the last try is the length of str1 you used, it is easy to reconstruct the winner even though you need to go too far to find the limit. You might even use bisection to find the right number! If str1 has 6 bytes, try 3, then 1 or 5, then 0, 2, 4, 6 to find the highest substring that works. Use the shorter length of str1 and str2, as the max overlap is that length and the min is always 0.

I helps in C to know what is going on, at least in a sensible model. Imagine a char* in a 32 bit system is a 4 byte unsigned integer offset from the bottom of memory. Values in the environment are in the heap below free memory. The heap grows up like stalagmites. Subroutine arguments and auto variables are on the stack at the top of memory growing down like stalactites. When you call, automatics may have whatever old ram content in them, so write before you read. When you exit a subroutine, the stack pointer rises above the automatics and passed variables, and the space is reused/overwritten on the next call. When you create a static or global variable, the compiler/linker allocates it on the heap. When you malloc, that is on the heap, too. Since you can free, someone has to keep track of the holes and try to reuse them. Fancy items like structs and such may be allocated on a mod 4 or 8 address, for speed. Some CPUs need aligned variables -- 4 byte integers have to have mod-4 addresses. So, space can be wasted when little and big items are mixed. If you mmap() a file region, address space is allocated, but rather than setting it up to swap to swap space, it is tied to the file. Some people do not like calling everything at the bottom 'the heap'. Dynamically linked code is mmap'd into several places. Itmight have initialized variables and constants with it, which are put in different areas, sometimes because code is on executable pages but data may be not executable, even not writable. If you run a command under truss, tusc or strace, you can see all this going on -- very educational.

1 Like