Replacing space with hyphen in a pattern.

I have a huge text file, about 52 GB. In the file, there are patterns like these:

[[Renaissance humanism|renaissance]]
[[Taoism|Taoist]]
[[Foundation for Economic Education]]
[http://www.theanarchistlibrary.org/HTML/Daniel_Guerin__Anarchism__From_Theory_to_Practice.html#toc2 ''Anarchism: From Theory to Practice'']
[[self-governance|self-governed]]

One can see that there is text within patterns such as [] and [[]], and I am only interested in [[]]. There is text before and after all these patterns too, for example,

''Anarchism''' is a [[political philosophy]] that advocates [[self-governance|self-governed]] societies with voluntary institutions.

.

My aim is to replace space with a hyphen for words enclosed within the pattern [[]], for example, the expected out that I am aiming for is:

[[Renaissance-humanism|renaissance]]
[[Taoism|Taoist]]
[[Foundation-for-Economic-Education]]
[http://www.theanarchistlibrary.org/HTML/Daniel_Guerin__Anarchism__From_Theory_to_Practice.html#toc2  ''Anarchism: From Theory to Practice'']
[[self-governance|self-governed]]

I have written a C program (which I paste below) to solve this problem, but there seems to be two issues with it, one it is not perfect to place hyphen and second it seems to crash when a non-ASCII character is encountered. I was wondering whether there is a way to solve the same problem using sed or awk or something similar in BASH. The reason why I want to move to a regular expression parser in BASH is that if I spend a lot of time editing and perfecting this code, it might still crash when encountering non-ASCII characters as I found it difficult to get rid of all non-ASCII characters from the file.

#include<stdio.h>
#include<string.h>
#include<stdlib.h>

int main ( int argc , char ** argv )
{
    FILE *text_file = NULL;
    text_file = fopen ( argv [ 1 ] , "r" );
    if(text_file == NULL )
    {
        fprintf(stderr,"file open error\n");
        exit(1);
    }
    char ch = '\0';
    
    while ( !feof ( text_file ))
    {
        ch = fgetc ( text_file );
        printf ("%c" , ch );
        if ( ch == '[')
        {
            ch = fgetc ( text_file );
            if ( ch == '[')
            {
                ch = fgetc ( text_file );
                while ( ch != ']' )
                {
                    printf("%c" , ch );
                    if ( isspace(ch))
                    {
                        ch='-';
                    }
                }
            }
            else
            {
                ungetc(ch,text_file);
            }
        }
    }
    
    fclose(text_file);
    return ( EXIT_SUCCESS);
}

How about

sed ':L;s/\(\[\[[^ ]*\) \([^]]*\]\)/\1-\2/;tL' file
[[Renaissance-humanism|renaissance]]
[[Taoism|Taoist]]
[[Foundation-for-Economic-Education]]
[http://www.theanarchistlibrary.org/HTML/Daniel_Guerin__Anarchism__From_Theory_to_Practice.html#toc2 ''Anarchism: From Theory to Practice'']
[[self-governance|self-governed]]
''Anarchism''' is a [[political-philosophy]]-that-advocates-[[self-governance|self-governed]] societies with voluntary institutions.

Thanks, and sorry for not being very clear in my initial post. The output that I am trying to get is:

'''Anarchism''' is a [[political-philosophy]] that advocates [[self-governance|self-governed]] societies with voluntary institutions.

Another example,

The first political philosopher to call himself an anarchist was [[Pierre-Joseph-Proudhon]], marking the formal birth of anarchism in the mid-nineteenth century. 

Some noisy text:

[[Online-etymology-dictionary]].</ref> The first known use of this word was in 1539.<ref>"Origin of ANARCHY

Therefore, I only aim to replace space with a '-' within this pattern "[[ ]]" and no where else in the text file. The outputs that I get from the above two suggested solutions are putting hyphens everywhere.

Add a closing square bracket to the first search pattern:

sed ':L;s/\(\[\[[^] ]*\) \([^]]*\]\)/\1-\2/;tL' file

and try again.

1 Like