Segment fault for C++ program when return vector

I am trying to reverse complement DNA sequence (string) with a short c++ code using boost library. Code was compiled without any warning/error, but ran into Segmentation fault.
My guess is the function to return a vector, but not sure.

#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include <boost/assign.hpp>

using namespace std;
using namespace boost::assign;

vector <char> revcomp(string str) 
{
vector <char> str_rc;
map <char, char> m = map_list_of('A','T')('T','A')('G','C')('C','G');

for (unsigned i = str.size(); i >= 0 ; --i)
{
str_rc.push_back(m[str]);        //get all the chars except str[0]
}
str_rc.push_back(m[str[0]]);        //get the first char 

return str_rc;
}

int main()
{
vector <char> str2;
string str1 = "ATCGTTTCCC";
str2 = revcomp(str1);

vector<char>::iterator itr2;
for (itr2 = str2.begin(); itr2 != str2.end(); ++itr2)
{
cout << *itr2;
}
cout << endl;

return 0;
}
$ g++ -Wall revcomp.cpp 
$ ./a.out
Segmentation fault

Can anyone look into it? Thanks a lot!

This is the error:

for (unsigned i = str.size()-1; i >= 0 ; --i)

An unsigned number is never going to be less than zero, making this an infinite loop. Some compilers will warn you about this.

That said, this code looks suspect in a number of ways. Why are you using strings as the input and vector as the output -- why not vector or string for both? Why return giant vectors since it means it'll be copying them again and again and again and again and again? Pass one in as a reference.

Above all, why use boost? You will be chained to that depleted uranium lawn gnome forever. There's plenty of ways to initialize a map.

const char *i[]={"AG", "TC"};
for(int n=0; i[0][n]; n++)  { m[i[0][n]]=m[i[1][n]]; m[i[1][n]]=m[i[0][n]]; }

Or don't even use a map at all:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

static const unsigned char map[256]={
        ['A']='T',      ['T']='A',      ['G']='C',      ['C']='G'       };

int main(int argc, char *argv[])
{
        int n;
        const char *input="ATCGTTTCCC";
        char *output=strdup(input);

        for(n=0; input[n]; n++) output[n]=map[input[n]];

        printf("%s = %s\n", input, output);
}

Vectors and maps have made this problem more difficult, not less.

1 Like

LOL

Above all, why use boost?
1) It's simply a learning trial by "copy-paste" and still at stage of "try and error" to understand more, hopefully;
2) The purpose is to make a function which will be combined to other .cpp code. Modified from a snippet online as was told "ultra-fast"!

// http://www.bashthat.com/2014/09/30/reverse-complement-response.html
#include <boost/assign.hpp>

using namespace std;
using namespace boost::assign;

//compile with g++                                                          

int main( int argc,const char *argv[] )
{
  string str = argv[1];
  map<char, char> m = map_list_of ('A', 'T') ('T', 'A') ('G', 'C') ('C', 'G');

  string::iterator it;
  for ( it = str.begin() ; it != str.end(); ++it )
  {
    cout <<  m[*it] ;
  }
  cout << endl;
}

Am I stubborn and offending if I insist C++, as I need modify your code to reverse the output and then integrate it into other .cpp code?
Thanks a lot in any way!

You are learning from suspicious sources. It is not "fast" to create and initialize a new map every time you call that function. It is especially not fast to return a vector by value.

Frankly?

My "pure C" code is one cout away from being "pure C++" code. That it looks so mysterious means you've skipped learning big, important chunks of both. It's like trying to avoid using variables of type 'int', or doing everything by calling system("this"); system("that");

Learning how arrays, strings, structures, and pointers actually work will explain almost everything that's baffling you right now. Then you'll be equipped enough to actually troubleshoot your problems and learn more, and take proper advantage of the standard template library -- most of its design imitates them, after all. Maps and vectors imitate arrays, map[index]. Iterators imitate pointers to arrays(and crash all the same ways as them): iterator->content, iterator++... If you don't understand those, you'll never know what you're looking at.

And STL strings imitate real strings (which are themselves, arrays). Notice how little really changes when I substitute one for the other.

#include <iostream>
#include <string>

using namespace std;

unsigned char m[256];

int main() {
        int n;
        string input="ATCGTTTCCC";
        string output=input;
        m['A']='T';     m['T']='A';     m['G']='C';     m['C']='G';

        for(n=0; n<input.length(); n++) output[n]=m[input[n]];

        cout << input << " " << output << "\n";
        return(0);
}
1 Like

Learning how arrays, strings, structures, and pointers actually work will explain almost everything that's baffling you right now.
I am very aware of that, which reminds me of my early posts in this forum. Learning by "practice" related to real problem is better, like this one and the strtok() function ......
Am I stubborn and offending if I insist C++
As I have the same feeling like most of the beginners that sometimes C++ is easier to catch than C, and I am not able to mix C with C++ code together in one program at this moment. Neither do I want get lost when trying to solve one problem, so that I asked for C++.
OK, here is a modified version of the code, and I hope it is a improved version from "professional views"

#include <iostream>
#include <string>
#include <algorithm>

using namespace std;
string revcomp(string);

int main(int argc, char *argv[]) 
{
string str = argv[1];
string str2 = revcomp(str);

cout << str << " = " << str2 << endl;

 return 0;
}

string revcomp(string input)
{
unsigned char m[256];
int n;
string output = input;

m['A'] = 'T'; m['T'] = 'A'; m['C'] = 'G'; m['G'] = 'C';
m['a'] = 'T'; m['t'] = 'A'; m['c'] = 'G'; m['g'] = 'C';

for (n = 0; n < input.length(); n++)
{
output[n] = m[input[n]];
}
string result = output;
reverse(result.begin(), result.end());

return result;
}
$ g++ -Wall test.cpp 
$ ./a.out aCTGCTACCC
aCTGCTACCC = GGGTAGCAGT

Which is what I need.
Thank you so much again!

You have learned one way to avoid your blind spots, but your blind spots are still there.

Would you say you could program Perl if you just had line after line of system("echo hello world"), system("mv file1 file2"), system("echo $variable") ?

You understand integers and for-loops, and that's about all. Everything else you've just "shelled out" to an external library. Your programming comfort zone is very, very small.

If you don't recognize something, that doesn't make it "C". You've thrown out big parts of both languages along the way.

Labeling everything you don't understand as "C" and therefore "out of the curriculum" is part of the problem. You've found one C++ library you like and decided that must be what the language is all about. It's not. You've overlooked something important.

I seriously think you should try and program without STL vector, string, and map for a while.

1 Like

A better translation of my program:

#include <iostream>
#include <string>
#include <ctype.h>

using namespace std;

// Making it global is important if you want unused elements
// to default to 0 and not garbage!
unsigned char m[256];

// Passing by reference avoids making a copy of a copy of a copy.
void revcomp(const string &input, string &output) {
        int n=input.length(), p=-1;

        output=input;
        while(n > 0) output[++p]=m[toupper(input[--n])];
}

int main(int argc, char *argv[])
{
        string str = argv[1], str2;
        // Do this once, and you can call revcmp as many times as you want without repeating it.
        m['A'] = 'T'; m['T'] = 'A'; m['C'] = 'G'; m['G'] = 'C';

        revcomp(str, str2);

        cout << str << " = " << str2 << endl;

        return 0;
}
1 Like

Thanks!
Agree with what you said: I seriously think you should try and program without STL vector, string, and map for a while.
One thing new and I never found in any book I've read is Using char as subscript for char array m[256], m['A']; m['T']...;
Two more questions:
1) Any specific reason to have m[256] instead of m[4], which should be enough for this case?
2) in revcomp() function, what's the reason using output = input; . I would normally allocate new memory for output. (---That's the difference between pro and novice!)
Thanks!

What is a character but a smaller integer? :slight_smile: Single-character constants count as integers.

4 is not enough: printf("%d %d %d %d\n", 'A', 'T', 'C', 'G'); prints 65 84 67 71 You will note these correspond exactly to the ASCII character table.

The array is a look-up table, like a Caesar cipher, big enough to cover the entire ASCII range. This is very simple to do and lets you translate with a single instruction.

It's also one reason it should be a global variable. Global variables are guaranteed to have any contents you don't initialize, set to binary 0. Not that we're using any other contents, but if garbage gets in, it's reassuring to know garbage won't come back out.

The same reason I used strdup -- to make a new string of correct length. I don't really care what's in it as long as it's long enough. I think this is more efficient than calling string.push_back() 570 times.

But strings have a lot of properties of vectors, so I could also have done this:

void revcomp(const string &input, string &output) {
        int n=input.length(), p=-1;
        output.resize(n);
        while(n > 0) output[++p]=m[toupper(input[--n])];
}

...which is probably better, especially if input happens to be large. No point copying it then just overwriting it.

1 Like

Single-character constants count as integers. I realized that after I posted my question. So the subscripts are still integers but in CHAR format as 'A', 'C', 'G', 'T' corresponding to 65, 67, 71 & 84. Although the other elements of the array are empty (or, 0), the sacrifice is worth of it. Am I right?
... same reason I used strdup -- to make a new string of correct length. Got it now!
Thank you so much again!

They are integers in integer format. (I double-checked -- they are the same size as integers.) They're no more or less numbers than 39 or 0xc0 -- none of those are how the machine really thinks of these numbers, just convenient human-readable representations.

The compiled program actually has no distinction between characters and numbers. In scripting languages and such, it's sometimes difficult to get the ASCII value of a character. In C, the ASCII value is all you ever have. Or arrays full of ASCII values.

That's my opinion, yes. Since a map is stored as a list, it has to search one-by-one, which amounts to doing this:

char trans(char c) {
        if(c == 'C') return('G');
        else if(c == 'c') return('G');
        else if(c == 'G') return('C');
        else if(c == 'g') return('C');
        else if(c == 'A') return('T');
        else if(c == 'a') return('T');
        else if(c == 'T') return('A');
        else if(c == 't') return('A');
}

...Which doesn't look that fast to me.

I could make the array even faster by removing the toupper(), and just storing m['c']='G', etc in the array too.

The first program I showed you in this thread used designated initializers:

char m[256]={ ['A']='T' };

Which is especially nice since it lets you set the values in advance without having to tell it all 256 of them. But that turned out to not be C++ compatible so I dropped it.

1 Like

I'd recommend filling the array with ' ' (space) or '?' characters instead of zero values. A zero value will cause your output string to be prematurely terminated if you get garbage input.

1 Like

I'd recommend filling the array with ' ' (space) or '?' characters instead of zero values.
How to do it? Could you please post your code to fill up the rest 252 element of the array? Thanks!

memset(m, ' ', 256);
m['C']='G'; m['G']='C' etc.
1 Like