Help on digestion of C header files for a short program.

I found a very short but very efficient program to handle big sequence file (>30GB), but could not understand it.

Wrote to the author but no reply, probably because the program needs comprehensive knowledge of C.
Can any C expert "walk me thru" the two header files (attached) before I go to the program (not attached). I know it is quite tedious to explain line by line, but I felt desperate with C. Hesitated for some time, can't hold asking for help here. Thanks a lot in advance!

What they're doing is a mess of defines to declare different structures depending on what types and sizes you ask for, to make an efficient hardcoded hashtable. It's like C++ templates done by hand with nothing but messes of defines calling defines calling defines.

The header includes an example which I will try and comment and clarify for you.

#include "khash.h"

// Declares a hash map with integer keys and character values.
// "32" seems to be a name, not a size.  It'd actually become kh_struct_something_something_32 or such.
KHASH_MAP_INIT_INT(32, char)

int main() {
        int ret, is_missing;
        khiter_t k;
        // Declares a pointer to a hash table, and allocates it
        khash_t(32) *h = kh_init(32);
        // Insert key 5 into the table, get index in k
        k = kh_put(32, h, 5, &ret);
        // Set index k to value 10
        kh_value(h, k) = 10;
        // Look up key 10 in the hash table
        // We inserted key 5, so it ought to be missing.
        k = kh_get(32, h, 10);
        // If it's missing, it will equal kh_end.         
        is_missing = (k == kh_end(h));
        // Loopk up key 5, which we inserted earlier.       
        k = kh_get(32, h, 5);
        // Delete it.
        kh_del(32, h, k);
        // Loop from beginning to end.
        for (k = kh_begin(h); k != kh_end(h); ++k)
                // Set all existing values to 1.
                if (kh_exist(h, k)) kh_value(h, k) = 1;

        // Free the hash table.
        kh_destroy(32, h);
        return 0;
}

Thanks for the explanations!
Seems I understand your comments, but still quite abscure about the code----what/where are the functions behind.
1) in the khash.h file line 29: #include "khash.h" Can the header include itself?
2) There are so many backslash to escape the newline. Is that only the author's preference or must-do?
3) The functions or struct khiter_t, khash_t, khinit(), kh_put ... Where to find their declaration/prototype ?
4) I thought some of those functions may come from the C standard library, but the prefix kh_ seems not the case here, right? Thanks again!

All of that entire block of code is inside /* C-style comments */. It is not compiled, it is informational.

It's because he assembled these functions using text-replacement macros. If you don't want a macro to be on one huge line, you have to escape it with \.

khiter is just an integer. They name it khiter so you can tell the iteration part of the code easily apart from the rest.

khash_t is another macro. khash_t(32) becomes kh_32_t .

The type is actually declared by KHASH_MAP_INIT_INT(32, char)

They do it like this because the internals of the structure are different, depending on what you feed into KHASH_MAP_INIT_INT.

In the end it actually calls KHASH_TYPE(), which looks like this:

#define __KHASH_TYPE(name, khkey_t, khval_t) \
typedef struct { \
khint_t n_buckets, size, n_occupied, upper_bound; \
khint32_t *flags; \
khkey_t *keys; \
khval_t *vals; \
} kh_##name##_t;

...so after substitution you would end up with the structure

typedef struct {
khint_t n_buckets, size, n_occupied, upper_bound;
khint32_t *flags;
int *keys;
char *vals;
} kh_32_t;

kh_put is declared when you call KHASH_INIT, which calls KHASH_INIT2, which plunks down code like this:

SCOPE khint_t kh_put_32(kh_32_t *h, khkey_t key, int *ret) \
{ \
khint_t x; \
if (h->n_occupied >= h->upper_bound) { /* update the hash table */ \
if (h->n_buckets > (h->size<<1)) kh_resize_32(h, h->n_buckets - 1);
else kh_resize_##name(h, h->n_buckets + 1); /* expand the hash table */ \
} /* TODO: to implement automatically shrinking; resize() already support shrin$
{ \
khint_t inc, k, i, site, last, mask = h->n_buckets - 1; \
x = site = h->n_buckets; k = __hash_func(key); i = k & mask; \
if (__ac_isempty(h->flags, i)) x = i; /* for speed up */        \
else { \
inc = __ac_inc(k, mask); last = i; \
while (!__ac_isempty(h->flags, i) && (__ac_isdel(h->flags, i) || !__hash_equal($
if (__ac_isdel(h->flags, i)) site = i; \
i = (i + inc) & mask; \
if (i == last) { x = site; break; } \
} \
if (x == h->n_buckets) { \
if (__ac_isempty(h->flags, i) && site != h->n_buckets) x = site; \
else x = i; \
} \
} \
} \
if (__ac_isempty(h->flags, x)) { /* not present at all */       \
h->keys[x] = key; \
__ac_set_isboth_false(h->flags, x); \
++h->size; ++h->n_occupied; \
__ac_set_isboth_false(h->flags, x); \
++h->size; ++h->n_occupied; \
*ret = 1; \
} else if (__ac_isdel(h->flags, x)) { /* deleted */     \
h->keys[x] = key; \
__ac_set_isboth_false(h->flags, x); \
++h->size; \
*ret = 2; \
} else *ret = 0; /* Don't touch h->keys[x] if present and not deleted */ \
return x; \
} \

In other words, the reason this code is so convoluted is because it's a code generator. The header file doesn't contain functions, it contains macros that declare different functions depending on what you feed them.

None of this is standard. Except maybe SCOPE. Maybe. I'm not sure what that is.

Thank you so much! Now I understand more.
Can I say the code becomes very efficient (in speed) to handle large file because of those marcos?
I will read more and trying to catch the whole picture, hope I could make use of these two headers eventually. By the way, how much do I owe you? Wish you are close so that I can buy you beer after work.

It is a hash table, with all the benefits and problems that come with hash tables. It's very good at finding things based on keys. Whether it's "efficient at dealing with large files" depends entirely on how you use it and what for.

Whether it's better or worse than other hash tables is hard to say. Compilers can do neat things these days.

As for using it in your own code, see the example I first posted, which was included in the header file itself.

1 Like

I have some unclear catch for the following lines in the "khash.h" file:

#ifndef __AC_KHASH_H
#define __AC_KHASH_H
#define AC_VERSION_KHASH_H "0.2.6"

I understand #ifndef and #ifdef to avoid redundant inclusion of the headers/definition etc, but not sure the rules to name the content (here the __AC_KHASH_H) as I was recommended capitalized file name KHASH_H or __KHASH_H in general because the file is "khash.h". The author used __AC_KHASH_H. It seems to me anything is fine once it is unique. Did I miss anything? My unclear parts are:
1) What are the rules to name the content of #ifndefine, if any, in general for software developer?
2) Here AC_VERSION_KHASH_H "0.2.6" seems does not impact the program but for version record of the code, is this correct?
Thank you!

There are no rules (Except that it must be a valid name for a symbol), just traditions. He could have called it SLARTIBARTFAST and it would have worked.

The version does nothing but note the version. You can consider it an unused variable.

1 Like

The name #define d doesn't matter much as long as every header that you will be using together uses a different name. A simple convention like __HeaderName (with _ replacing . in the header's name [sometimes after converting the header's name to uppercase]) is a common convention using namespace that is reserved by the C and POSIX standards for system implementor's use (not application writer's use). If you are writing application code (not part of the system an implementation delivers to customers), you need to choose a naming convention outside of this namespace that is coordinated with code from any other third party vendors whose code may be used with your code. (And if you're writing code you expect other's to use with their own code, you need to document the convention you use so they can avoid naming conflicts when they use your code.) You won't be adhering to the standards, but if you look at your system's headers you can probably easily figure out the conventions used by your OS vendor and use a similar, but slightly different, convention that will be unilkely to conflict with future vendor updates.

The standards reserve namespaces for the standards and namespaces for system implementers; everything else is available to application writers. People who write 3rd party software and end users are both considered "application writers" by the standards. So if you are in that category, you are stuck with finding a naming convention that doesn't trample on the namespace reserved for standards and system implementors (or your code might start behaving in strange ways when the next update to the system software is installed) and is "unique" compared to the namespaces used by other 3rd party code writers and end users.

Talk to the experts at your employer for the language you're using and follow their coding conventions. (Most medium and large sized companies will have these conventions on-line and give new hire programmers a pointer to them as part of their first day on the job "Here's How We Do Things at XYZ Corp." training.)

I will disagree slightly with Corona688 on the version information. If something isn't working for you with code you get from a 3rd party, they may ask for that version number so that can figure out which version of their product you're using. If you expect to ever make changes to headers that you write that will affect programmers using your code, including version information in your headers may save you or your successors many headaches later. :wink:

1 Like

Thanks Don!
I am not a professional coder, and this code discouraged me very much to catch C by myself. I do not want to give up yet. In the original "kseq.h" header line 220-224, 226 and 228 - 233. After I removed the escape backslash the three #define become:

#define KSEQ_INIT2(SCOPE, type_t, __read) KSTREAM_INIT(type_t, __read, 16384) __KSEQ_TYPE(type_t) __KSEQ_BASIC(SCOPE, type_t) __KSEQ_READ(SCOPE)
#define KSEQ_INIT(type_t, __read) KSEQ_INIT2(static, type_t, __read)
#define KSEQ_DECLARE(type_t) __KS_TYPE(type_t) __KSEQ_TYPE(type_t) extern kseq_t *kseq_init(type_t fd); void kseq_destroy(kseq_t *ks); int kseq_read(kseq_t *seq);

For the first and third #define, are they just auto-expanding to eventually become single #define (like the second #define), or very specific/complicated reason here? This is my first time to see #define "multiple?" things in a single row. Thanks a lot!

Yes, they are using multiple defines. One invokes others when the program is compiled. They're just using one define to invoke a big mess of things.

Remember, they are text replacement. They get called when the program is compiled. Imagine this simpler example:

#define declare_add(TYPE)  type add_##type##(type a, type b) { return(a+b); }

When you use it with, say, int -- it plunks down this text:

int add_int(int a, int b) { return(a+b); }

...for you to call in your program later.

You are digging far, far, far deeper than needed, IMHO, you already have an example of how to use it. If you want to reverse engineer it from scratch, why not just make a new one?

1 Like

OK. I'm confused. When i use the macro definition above, the definition above with the TYPE converted to a single case, and the definition above with mixed case type but only the first ## , I get various complaints from the compiler.

If we're going to give yifangt a simple example, I think we need to provide one that compiles.

Shouldn't the macro be:

#define declare_add(type)  type add_##type(type a, type b) { return(a+b); }

or:

#define declare_add(TYPE)  type add_##TYPE(TYPE a, TYPE b) { return(a+b); }

(i.e., TYPE all uppercase or all lowercase and only one ## )?
The following code:

#include <stdio.h>
#include <stdlib.h>

#define declare_add(type)  type add_##type(type a, type b) { return(a+b); }
declare_add(int)
declare_add(float)
declare_add(double)

int
main(int argc, char *argv[]) {
        printf("%s + %s = %d\n%s + %s = %.15f\n%s + %s = %.15f\n",
                argv[1], argv[2], add_int(atoi(argv[1]), atoi(argv[2])),
                argv[3], argv[4], (double)add_float(atof(argv[3]), atof(argv[4])),
                argv[3], argv[4], add_double(atof(argv[3]), atof(argv[4])));
        return 0;

saved in a file named tester.c and built with gcc on Mac OS X using:

make tester

and invoked as:

tester 123 765 123.456e6 123.456e-10

produces the output:

123 + 765 = 888
123.456e6 + 123.456e-10 = 123456000.000000000000000
123.456e6 + 123.456e-10 = 123456000.000000014901161

which is what I would expect when adding the corresponding pairs of operands using types int, float, and double.

Note that the C standards say that atof() returns a double and the double will be cast to a float when it is passed to add_float() since add_float() is declared to take arguments of type float and return a float result. The float result is cast to a double because the operand to the printf() %f conversion specification is supposed to be a double (not a float). Some compilers are smart enough to cast varargs arguments to printf() to a type appropriate for the given conversion specification, but the standards don't require that level of complexity. Portable applications should cast arguments to printf() to types appropriate for the conversion specification and not assume that a compiler will fix your code for you nor that a compiler will give you a warning or an error if you don't pass an argument of the appropriate type to a function that has varying numbers of arguments.

With Mac OS X's gcc, the C preprocessor will convert the macro calls above to:

int add_##int(int a, int b) { return(a+b); }
float add_##float(float a, float b) { return(a+b); }
double add_##double(double a, double b) { return(a+b); }

and a later phase of the compiler will convert that to:

int add_int(int a, int b) { return(a+b); }
float add_float(float a, float b) { return(a+b); }
double add_double(double a, double b) { return(a+b); }

Note that I fully agree with Corona688 that taking a look at very complex macros and typedefs in a header is not an easy way to learn C. You need to start with a more basic tutorial on the language working through examples that you can easily understand. Then work yourself up to more complex constructs when you need them as you write C code to do tasks you want to perform.

If you're going to use this system of headers, libraries, and utilities, you should look at the documentation on how to use the tools that are provided rather than digging through the headers trying to figure out how it is implemented. (It may be that the examples in the headers are a major part of the documentation, but that doesn't mean you need to understand how the macros work to figure out how to use them.)

If you look at the system headers on a Mac OS X or a Solaris system, there is a good chance that you'll feel lost trying to figure out what all of the #ifdef's are doing and what all of the __name functions and variables mean. They are there to allow a single header to be used on a variety of CPU architectures; providing various programming environments with different sizes of integers, sizes of pointers, maximum sizes of file offsets; and to conform to various revisions of various standards. They are meant to be used as a black box to work as specified by the environment you tell the compiler you want to use in your program. If you want to know what the header is supposed to provide to you, look in the standard that defines the programming environment you want to use or in your system's man pages. Don't assume that they should be understood by novice C programmers. (Many expert C programmers can need weeks of concerted study to learn how some of these system headers work.)

3 Likes

Thank you so much, Don and Corona668!
I really appreciate your explanation. I was trying catch the details of C. The header files were chosen as I could not understand them at all when I found them, which are not suitable for study but to start with. Next I will try to learn how to use them, exactly as suggested by Corona688. Besides the speed, another reason I was trying to go thru them is to modify them for some scenarios of my use---I am always warned "Do not re-invent the wheel!" Of course! It would be a long way to catch C to a novice level. I have learned a lot from this post, and thanks a lot again!

There are plenty of reasons to re-invent the wheel. It helps you understand what you're looking at, for one thing.

It is a library, so hopefully you shouldn't need to modify it too much for your own use. The included example code would be a better place to start than the innards...