Replace a multi-line strings or numbers

khaled79 · May 20, 2013, 2:54pm

Hi

I have no experience in Unix so any help would be appreciated

I have the flowing text

I need to find this sequence from A file

45654
199
225

and replaced it with in B file

45654
258

so the new file B will be

any help?

Yoda · May 20, 2013, 3:04pm

Here is a solution using awk:

awk '
        {
                A[++c] = $1
        }
        END {
                for ( i = 1; i <= c; i++ )
                {
                        if ( A == 45654 && A[i+1] == 199 && A[i+2] == 225 )
                        {
                                A[i+1] = 258
                                A[i+2] = 0
                        }
                        if ( A )
                                print A
                }
        }
' file

khaled79 · May 20, 2013, 3:09pm

Thanks Yoda

but what if I want to search for a variable sequence instead of known. for example

"variable number" 
199
225

will be

"variable number" 
258

Thanks

Yoda · May 20, 2013, 3:24pm

I didn't quite understand what you mean by variable sequence.

The program that I posted replaces 45654 199 225 to 45654 258

You just have to modify it as per your requirement.

khaled79 · May 20, 2013, 3:50pm

Thanx again Yoda

sorry for being not clear

what I meant was

what if I want to find sequence that followed by "xxx"

XXX
199
225

and replace it with

XXX
258

xxx could be any number between (1 to 260)

so, every time replace all the sequence followed by that XXX

Thank you

Yoda · May 20, 2013, 4:10pm

So I assume that you are going to define starting sequence in a variable.

In that case you can pass whatever variable to awk, assign it and use it.

You can code something like:

SEQ=XXX

awk -v S="$SEQ" '
        {
                A[++c] = $1
        }
        END {
                for ( i = 1; i <= c; i++ )
                {
                        if ( A == S && A[i+1] == 199 && A[i+2] == 225 )
                        {
                                A[i+1] = 258
                                A[i+2] = 0
                        }
                        if ( A )
                                print A
                }
        }
' file

I used shell variable: SEQ , replace value XXX with the number of your choice. I hope this helps.

rveri · May 20, 2013, 5:06pm

Khaled79,
Check this out:

# v=45654;perl -0777  -pe 's/$ENV{v}\n199\n225/$ENV{v}\n258/igs' file

khaled79 · May 21, 2013, 9:14pm

rveri

it wont work !

 
# v=45654;perl -0777  -pe 's/$ENV{v}\n199\n225/$ENV{v}\n258/igs' file

---------- Post updated at 08:14 PM ---------- Previous update was at 07:45 PM ----------

yoda:

So I assume that you are going to define starting sequence in a variable.

In that case you can pass whatever variable to awk, assign it and use it.

You can code something like:
SEQ=XXX
 
awk -v S="$SEQ" '
   {
   A[++c] = $1
   }
   END {
   for ( i = 1; i <= c; i++ )
   {
   if ( A == S && A[i+1] == 199 && A[i+2] == 225 )
   {
   A[i+1] = 258
   A[i+2] = 0
   }
   if ( A )
   print A
   }
   }
' file
I used shell variable: SEQ , replace value XXX with the number of your choice. I hope this helps.

Yoda

for small files awk working with no problem
but, with large files the awk shows this error

 
awk: cmd. line:3: (FILENAME=a.txt FNR=18498251) fatal: more_nodes: nextfree: can 't allocate 4000 bytes of memory (Cannot allocate memory)

and this error too printed in Cygwin terminal

 
line 3: 7488 Aborted (core dumped) awk -v S="$SEQ" '
{
A[++c] = $1
}
END {
for ( i = 1; i <= c; i++ )
{
if ( A == S && A[i+1] == 199 && A[i+2] == 225 )
{
A[i+1] = 258
A[i+2] = 0
}
if ( A )
print A

}
}
' ascii.txt >pre.txt

any help about it?

Thanks a lot

Yoda · May 21, 2013, 9:56pm

How about this awk code?

SEQ=45654

awk -v S="$SEQ" '
        $0 == S {
                V = $0
                getline
                if ( $0 == 199 )
                {
                        getline
                        if ( $0 == 225 )
                        {
                                print V RS "258"
                                next
                        }
                        else
                        {
                                print V RS "199" RS $0
                                next
                        }
                }
                else
                {
                        print V RS $0
                        next
                }
        }
        $0 != S {
                print $0
        }
' file

khaled79 · May 21, 2013, 10:25pm

Dear Yoda

I will test it now for large files I have

Thanks

---------- Post updated at 09:25 PM ---------- Previous update was at 08:58 PM ----------

Thanks Yoda

it works well for large and small files as well

Thank you very much.

could you please tell me what is the different between two codes?

Khaled

Yoda · May 21, 2013, 10:34pm

The first awk program that I posted loads all the records in your file into an Indexed Array A[++c] = $1 and in the end it performs the required operation.

This caused the program to throw Cannot allocate memory error for large files.

But in the second awk program, entire records are not loaded into any variable or array but instead checking record by record to perform the required operation.

Hence it is not a memory intensive program and works for large files.

khaled79 · May 22, 2013, 6:31pm

Yoda

can I make it search for any of tow or more numbers if found then replaced it like for example

 
SEQ=45654 | 234567  |57899 

awk -v S="$SEQ" '
        $0 == S {
                V = $0
                getline
                if ( $0 == 199 )
                {
                        getline
                        if ( $0 == 225 )
                        {
                                print V RS "258"
                                next
                        }
                        else
                        {
                                print V RS "199" RS $0
                                next
                        }
                }
                else
                {
                        print V RS $0
                        next
                }
        }
        $0 != S {
                print $0
        }
' file

Thanks a lot

Khaled

Yoda · May 22, 2013, 6:40pm

Yes you can. For implementing this change use regular expression comparison operators ~ and !~ instead:

SEQ="45654|234567|57899"

awk -v S="$SEQ" '
        $0 ~ S {
                V = $0
                getline
                if ( $0 == 199 )
                {
                        getline
                        if ( $0 == 225 )
                        {
                                print V RS 258
                                next
                        }
                        else
                        {
                                print V RS "199" RS $0
                                next
                        }
                }
                else
                {
                        print V RS $0
                        next
                }
        }
        $0 !~ S {
                print $0
                F = 0
        }
' file

khaled79 · May 24, 2013, 5:02pm

Yoda

I have used this code to print the number that comes before and after specific character

 
SEQ=200
awk -v S="$SEQ" '
        {
                A[++c] = $1
        }
        END {
                for ( i = 1; i <= c; i++ )
                {
                        if ( A == S )
                        {        
              print " the letter is " A   
                                print " followed by " A[i+1] 
                                print " comes after " A[i-1]  
                        }
                        
        
                               
                }
        }
' file

result was like the following

 
the letter is 200
 followed by 202
 comes after 211
 the letter is 200
 followed by 223
 comes after 212
 the letter is 200
 followed by 202
 comes after 211

I need it to print counter of times that happened if it repeated rather than print it many time so the result should be something like this

 
2 times 
the letter is 200
 followed by 202
 comes after 211
 
1 times 
the letter is 200
 followed by 223
 comes after 212

how I can do it?

Thanks a lot
Khaled

Yoda · May 24, 2013, 5:11pm

You can code something like:

SEQ=200
awk -v S="$SEQ" '
        {
                A[++c] = $1
        }
        END {
                for ( i = 1; i <= c; i++ )
                {
                        if ( A == S )
                        {
                                C[A,A[i+1],A[i-1]]++
                                V[A,A[i+1],A[i-1]] = "the letter is " A RS "followed by " A[i+1] RS "comes after " A[i-1]
                        }
                }
                for ( k in V )
                {
                        print C[k] " times"
                        print V[k]
                }
        }
' file

khaled79 · May 24, 2013, 5:23pm

Thanks Yoda

You are awesome!

how I can sort the output descending ?

Thanks

Yoda · May 24, 2013, 5:40pm

By default, the order in which a for (i in array) loop scans an array is not defined; it is generally based upon the internal implementation of arrays inside awk.

So you have to use an Indexed Array to help preserve the order, try this modified code:

SEQ=200
awk -v S="$SEQ" '
        {
                A[++c] = $1
        }
        END {
                for ( i = 1; i <= c; i++ )
                {
                        if ( A == S )
                        {
                                if ( !(V[A A[i+1] A[i-1]]) )
                                        T[++j] = A A[i+1] A[i-1]
                                C[A A[i+1] A[i-1]]++
                                V[A A[i+1] A[i-1]] = "the letter is " A RS "followed by " A[i+1] RS "comes after " A[i-1]
                        }
                }
                for ( i = 1; i <= j; i++ )
                {
                        print C[T] " times"
                        print V[T]
                }
        }
' file

khaled79 · May 24, 2013, 5:50pm

The result wasn't ordered descending based on times

 
25 times
the letter is 200
followed by 202
comes after 211
36 times
the letter is 200
followed by 223
comes after 212

it should print the 36 times result followed by 25 result

Thanks

Yoda · May 24, 2013, 6:00pm

I'm sorry. I misread your requirement. I thought you want to print records in the order that you have them in your input file.

If you have gawk then you can use below code to print in descending order:

SEQ=200
gawk -v S="$SEQ" '
        {
                A[++c] = $1
        }
        END {
                for ( i = 1; i <= c; i++ )
                {
                        if ( A == S )
                        {
                                C[A,A[i+1],A[i-1]]++
                                V[A,A[i+1],A[i-1]] = "the letter is " A RS "followed by " A[i+1] RS "comes after " A[i-1]
                        }
                }
                for ( k in V )
                {
                        T[++j] = C[k]
                }
                n = asort(T)
                for ( i = n; i >= 1; i-- )
                {
                        for ( k in V )
                        {
                                if ( C[k] == T )
                                {
                                        print C[k] " times"
                                        print V[k]
                                }
                        }
                }
        }
' file

khaled79 · May 24, 2013, 6:08pm

Thanks Yoda
Suppose that I have texts in different languages

how I could print the frequent n words

like frequent 10 or 100 ?

It is easy to do it for A-Z which is English

but, regardless the input language text how I can do it?

Thanks