find & Replace text using two non-unique delimiters.

I can find and replace text when the delimiters are unique. What I cannot do is replace text using two NON-unique delimiters:


"This html code <text blah >contains <garbage blah blah >. All tags must go,<text > but some must be replaced with <garbage blah blah > without erasing other info."
delimiter1: '<garbage'
delimiter2: '>'
replace with: 'important info'

delimiter3: '<'
delimiter4: '>'
replace with: ''

I get this:

This html code contains important info

And I want this:

This html code contains important info. All tags must go, but some must be replaced with important info without erasing other info.

The issue is that the program keeps seeing the '>' which is tied in with the'<text >' tag and using it instead of using the '>' which is tied in with '<garbage'.

In my real-world scenario, these tags are much more complicated and will have a variety of text inbetween whilst being different sizes and having different endings; also, certain tags must be deleted first, second, and so on, so changing the order will not help this situation.

I want to make code that understands that the '>' delimiter, which I want to use as an end position for '<garbage' tag, can only be the one which comes closest AFTER the '<garbage' tag (and if it understands that, then it cannot make a mistake); but I do not know how to do this. I have it working perfectly in an awk program, but not in C++. And I will not use boost; I'd rather then just stick with awk in that case.

Here is my code:

// Compile and run with:
// g++ -O -Wall replace.cpp -o replace


using namespace std;

string replaceText (string text, string tStart, string tStop, string tReplace)

	long int begPos;
	long int endPos;
	int found=1;

	while ((text.find(tStart) != std::string::npos) && (found == 1)) {

		found = 0;

		begPos = text.find(tStart);
		endPos = text.find(tStop);

			if (tStop != "")
				text.replace(begPos, endPos - begPos + tStop.length(), tReplace);
				found = 1;
				text.replace(begPos, tStart.length(), tReplace );
				found = 1;

		// Used for testing to see positions of replaced text:
		std::cout << "Replacing from: " << tStart << " " << tStop << " at Start Pos: " << begPos << " Stop Pos: " << endPos << " with " << tReplace << " \n" << endl;


	return text;

int main(int argc, char* argv[])
	keyFound="This html code <text blah >contains <garbage blah blah >. All tags must go, <text > but some must be replaced with <garbage blah blah > without easing other info.";

	// Run this code twice: once with the below line of code commented, and once without:
	keyFound=replaceText(keyFound, "<garbage", ">", "important info");
	keyFound=replaceText(keyFound, "<", ">", "");

	std::cout << keyFound << endl;

	return 0;

I am not expecting an entire answer, but maybe if someone could lead me to a resource which has a fitting answer. I've been looking all around, and I cannot seem to find anything. Also, I am new to C++.

I understand that this is an incredibly complicated thing with no simple answer.

Thank you.

When trying to match the end of a tag with its start, you need to look for the entire tag in a single search. Since there are several tags on the line, the way you are searching for the end tag may well find a > that comes before <garbage within the text string that you are searching.

To match the string starting with <garbage and ending with the closest matching > after that, try matching using the single BRE or ERE <garbage[^>]*> .

Woot! So easy to do having been told this! Thank you! :slight_smile:

#include <string>
#include <iostream>
#include <regex>
using namespace std;

int main(int argc, char * argv[]) {

	string test;

	test="This html code <text blah >contains <garbage blah blah >. All tags must go, <text > but some must be replaced with <garbage blah blah > without easing other info.";

	regex reg("<garbage[^>]*>");
	test = regex_replace(test, reg, "important info");

	cout << test << endl;

	return 0;


This html code <text blah >contains important info. All tags must go, <text > but some must be replaced with important info without easing other info.

Something tells me this regex function is going to be a livesaver! :slight_smile:

Okay, now to do what I do best...

... sleep. :smiley:

We assume that you know that exactly the same thing works in awk :

echo "This html code <text blah >contains <garbage blah blah >. All tags must go, <text > but some must be replaced with <garbage blah blah > without easing other info." |
    awk '{sub("<garbage[^>]*>", "important info")}1'

to make a substitution for the first occurrence producing the output:

This html code <text blah >contains important info. All tags must go, <text > but some must be replaced with <garbage blah blah > without easing other info.


echo "This html code <text blah >contains <garbage blah blah >. All tags must go, <text > but some must be replaced with <garbage blah blah > without easing other info." |
    awk '{gsub("<garbage[^>]*>", "important info")}1'

to make a substitution for all occurrences producing the output:

This html code <text blah >contains important info. All tags must go, <text > but some must be replaced with important info without easing other info.
Good you had another improvement of your code. Applying what you learned in some of your other threads ( gsub (tagIn "[^" tagOut "]*" tagOut, "") , post7, post2), you'd get what you request in post#1, setting tagin first to <garbage , then to just < .

Yes, I found that out. How satisfying it is to just drag and drop the regex parameters into the C++ code and have them work! :slight_smile:

The code has been updated, but please do not feel obligated to respond, though I do very much appreciate and welcome the advice of all of you! There is no urgent desire to fix anything; I'm just 'putting it out there.' :slight_smile:

If anyone would like to peruse and comment, they are welcome to:

// This program parses an XML dictionary file and prints a formatted result.
// NOTE: The required XML dictionary (16mb) will be downloaded to this
//       machine if it is not found! It will be stored in: ~/.config/latin/
// The goals of this project:
//	1. < 100 lines code
//	2. Simple & elegant coding
//	3. Fast & efficient execution.
//		"Do one thing,
//		 and do it well."
//		�Linux Credo
// Compile with:
// $ g++ -O -Wall lat.cpp -o lat
// Run with:
// $ lat amo sum totus
// Where 'amo', 'sum', and 'totus' are the words to be searched
// Gather online possibilities and pipe output into 'less'
// ('latc' script required for this functionality!!!):
// $ lat $(latc quam totus amor)
// Where 'quam', 'totus', and 'amor' are your search terms
// For testing. Completely clear terminal to not confuse with other text.
// $ reset; g++ -O -Wall lat.cpp -o lat; sleep 2; lat amo sum totus | less


using namespace std;

int main(int argc, char* argv[])
	// No search term entered. Bye!
	if (!argv[1]) return 1;

	std::string line;					// Used for file input
	std::string charToStr(argv[1]);				// Cannot use char with strings
	std::string keyStart	("key=\"" + charToStr + "\"");	// Key tags which word in XML file is surrounded
	std::string keyEnd	("</entry>");
	std::string text;
        struct passwd *pw = getpwuid(getuid());                 // Set up to get ~/
	std::string homeDir = pw->pw_dir;
	std::string XMLfile	(homeDir + "/.config/latin/Perseus_text_1999.04.0060.xml");
	std::string XMLfileDlURL="";

	//ifstream myFileTest (XMLfile);
	ifstream myFile(XMLfile);

	// Download dictionary if not found
	if (

		std::cout << "\nNote: The XML dictionary file " << XMLfile << " has not been found.\n\nDownloading and preparing XML file...\n\n";

                string dlCmd=("mkdir -p " + homeDir  + "/.config/latin/ && cd " + homeDir + "/.config/latin/ && wget -O- " +  XMLfileDlURL  +  " | tr -d '\\r' > " +  XMLfile);

		// system() won't accept a string
                const char * sysCharCmd = dlCmd.c_str();


		// Check again to see if the file was created and can be found

		if (
			std::cout << "Could not download or find file!\n\nExiting...\n\n";
			return 2;
			std::cout << "Finished downloading!\n\nRestart program to use new dictionary.\n\n";
			return 0;

	// Go through all given keys from command line parameters
	for(int keyNum = 1; keyNum < argc; keyNum++ )
		charToStr=argv[keyNum];				// Make compatible with int
		keyStart="key=\"" + charToStr + "\"";
		text="";					// Do not append text

		myFile.clear();					// Go to beginning of file
		myFile.seekg(0, ios::beg);

		// Find search key and save result in 'text' string
		while (getline (myFile,line) && text == "")
			if (line.find(keyStart) != std::string::npos)	// We found a key!
				do					// Grab keys text
					text += line;
				while (getline (myFile,line) && line.find(keyEnd) == std::string::npos);

		// Don't waste time�go to next iteration!
		if (text == "")
			std::cout << "Search key '" << charToStr << "' not found.\n" << endl;

		/* User may want to define an entire paragrapth of words
		   at one time, so do string modification right after
		   each key to allow first results to be shown instantly. */

		// Replace regex pattern in slot #1 with the text in slot #2.
		std::string tReplace[] = {"<orth>", "[", "</orth>", ",", "</gen>", ".", "<sense id.*><etym lang=\"la\" opt=\"n\">", "[", "<etym lang=\"la\" opt=\"n\">", "[", "</etym>, <trans opt=\"n\">|</etym>\\.�", "]\n\n � ", "(</etym>\\. �</sense>|</etym>\\.)", "]", "</etym>\\. </sense>", "", "(\\.|</usg>) ?� ?</sense>", ".", "<sense[^>]*>", "\n\n", "<[^>]*>", "", " � ", "\n\n � ", "\\. ?+�", ".\n\n � ", " +", " ", ". ?�", "\n\n", " ,", ",", " \\.", ".", " :", ":", "� ", "�", " �", "�", "^ ", "", "\\( ", "\\(", " \\)", "\\)" };

		// Now manipulate that text string and make it pretty.
		signed int repSize = (sizeof(tReplace) / sizeof(tReplace[0]));
		for (signed int i = 0; i < repSize; i += 2)
			regex reg(tReplace);
			text = regex_replace(text, reg, tReplace[i + 1]);

		// Give lots of space to easily distinguish between definitions
		std::cout << text << "\n\n\n";



	return 0;
