I have a huge text file, about 52 GB. In the file, there are patterns like these:
[[Renaissance humanism|renaissance]]
[[Taoism|Taoist]]
[[Foundation for Economic Education]]
[http://www.theanarchistlibrary.org/HTML/Daniel_Guerin__Anarchism__From_Theory_to_Practice.html#toc2 ''Anarchism: From Theory to Practice'']
[[self-governance|self-governed]]
One can see that there is text within patterns such as [] and [[]], and I am only interested in [[]]. There is text before and after all these patterns too, for example,
''Anarchism''' is a [[political philosophy]] that advocates [[self-governance|self-governed]] societies with voluntary institutions.
.
My aim is to replace space with a hyphen for words enclosed within the pattern [[]], for example, the expected out that I am aiming for is:
[[Renaissance-humanism|renaissance]]
[[Taoism|Taoist]]
[[Foundation-for-Economic-Education]]
[http://www.theanarchistlibrary.org/HTML/Daniel_Guerin__Anarchism__From_Theory_to_Practice.html#toc2 ''Anarchism: From Theory to Practice'']
[[self-governance|self-governed]]
I have written a C program (which I paste below) to solve this problem, but there seems to be two issues with it, one it is not perfect to place hyphen and second it seems to crash when a non-ASCII character is encountered. I was wondering whether there is a way to solve the same problem using sed or awk or something similar in BASH. The reason why I want to move to a regular expression parser in BASH is that if I spend a lot of time editing and perfecting this code, it might still crash when encountering non-ASCII characters as I found it difficult to get rid of all non-ASCII characters from the file.
#include<stdio.h>
#include<string.h>
#include<stdlib.h>
int main ( int argc , char ** argv )
{
FILE *text_file = NULL;
text_file = fopen ( argv [ 1 ] , "r" );
if(text_file == NULL )
{
fprintf(stderr,"file open error\n");
exit(1);
}
char ch = '\0';
while ( !feof ( text_file ))
{
ch = fgetc ( text_file );
printf ("%c" , ch );
if ( ch == '[')
{
ch = fgetc ( text_file );
if ( ch == '[')
{
ch = fgetc ( text_file );
while ( ch != ']' )
{
printf("%c" , ch );
if ( isspace(ch))
{
ch='-';
}
}
}
else
{
ungetc(ch,text_file);
}
}
}
fclose(text_file);
return ( EXIT_SUCCESS);
}