How to remove everything after a word containing string?

baris35 · January 25, 2019, 12:33pm

Hello,
I wish to remove any word coming after searched string found in a word.

source*.txt

#!bin/bash
#test1
http://www.aa.bb.cc http://www.xx.yy http://www.11.22.44
#test2
http://www.11.rr.cd http://www.01.yy http://www.yy.22.tt
#test3
http://www.22.qq.fc http://www.0x.yy http://www.t1.22.pk

readfile

aa.bb
11.rr.cd
qq.fc

Expected output:

#!bin/bash
#test1
http://www.aa.bb.cc
#test2
http://www.11.rr.cd
#test3
http://www.22.qq.fc

I only know how to delete anything coming after a string like below way:

while read COL1
do
sed 's/$COL1.*//' source
done <readfile

I'd appreciate if you could help me

Thank You
Boris

RudiC · January 25, 2019, 1:08pm

I remember telling you a few hours ago that variable expansion doesn't work within single quotes, so $COL1 will never be expanded. On top, even IF $COL1 were expanded, there still is a logical error in your sed command: it would remove $COL1 together with the rest of the line. And, that sed command is executed on the entire source file for every single line in readfile .

I think a better approach is awk . Try

awk 'NR==FNR {T[$1]; next} {for (t in T) if ($1 ~ t) $0 = $1} 1' file2 file1
#!bin/bash
#test1
http://www.aa.bb.cc
#test2
http://www.11.rr.cd
#test3
http://www.22.qq.fc

nezabudka · January 25, 2019, 1:23pm

Just in case the matches are not in the first field.

awk 'NR==FNR {T[$0]; next} {for(i=1;i<=NF;i++){for(t in T) if ($i ~ t) $0=$i}} 1' readfile source

baris35 · January 25, 2019, 1:48pm

I am sorry, as I have multiple files to be processed, I tried like this but could not get the output:

for file in source*.txt
do
awk 'NR==FNR {T[$0]; next} {for(i=1;i<=NF;i++){for(t in T){if ($i ~ t) $0=$i}}} 1' readfile $file \
> report_$(basename "${file/.txt}").txt
done

Thank you
Boris

RudiC · January 25, 2019, 2:13pm

With 250 posts in these forums, you may have learned that without detailed error information, analysis and debugging is close to impossible. So please provide:

Any error messages?
Are the output files created?
What be their contents?
Did you run the script with the -x (xtrace) option set?

RudiC · January 25, 2019, 2:17pm

@nezabudka: good approach, but you'll lose the leading fields if the pattern is found in later fields... does this come closer?

awk 'NR==FNR {T[$1]; next} {for (t in T) if ($0 ~ t) sub (t".*", t)} 1' file2 file1

EDIT: Indeed, you can drop the if :

awk 'NR==FNR {T[$1]; next} {for (t in T) sub (t".*", t)} 1' file2 file1

nezabudka · January 25, 2019, 2:32pm

Boris
Why do you delete the file extension and immediately add it?

> report_$(basename "${file/.txt}").txt

And why do you run it in a cycle, only you lose time. Try it first

awk 'NR==FNR {T[$0]; next} {for(i=1;i<=NF;i++){for(t in T) if ($i ~ t) $0=$i}} 1' readfile source*.txt

To localize the error, try adding a filter, for example

awk 'NR==FNR {T[$0]; next} NF > 1 && NF < 4 {for(i=1;i<=NF;i++){for(t in T) if ($i ~ t) $0=$i}} 1' readfile source*.txt

RudiC · January 25, 2019, 2:47pm

@nezabudka: The shell cycle is used to create the respective output files. If you do it all in awk , redirect output immediately within it. Like

awk '
FNR == NR       {T[$1]
                 next
                }
FNR == 1        {if (FN) close (FN)
                 FN = "report_" FILENAME
                }
                {for (t in T) sub (t".*", t)
                 print > FN
                }
'   readfile source*

baris35 · January 25, 2019, 11:20pm

nezabudka:

Boris
Why do you delete the file extension and immediately add it?
> report_$(basename "${file/.txt}").txt
And why do you run it in a cycle, only you lose time. Try it first
awk 'NR==FNR {T[$0]; next} {for(i=1;i<=NF;i++){for(t in T) if ($i ~ t) $0=$i}} 1' readfile source*.txt
To localize the error, try adding a filter, for example
awk 'NR==FNR {T[$0]; next} NF > 1 && NF < 4 {for(i=1;i<=NF;i++){for(t in T) if ($i ~ t) $0=$i}} 1' readfile source*.txt

PS: readfile and source are the same as I posted under this thread.

Hello,
I am sorry for the headache.
I posted in next script that I have more files to be processed.
So, I edited the main post.
What I typed in first post gives expected result.
I need time to check what was wrong at my end.

Thank you
Boris

--- Post updated at 05:09 PM ---

Hello Again,
Here is the output:

root@house:~/test# awk 'NR==FNR {T[]; next} {for (t in T) sub (t.*, t)} 1' readfile source
awk: cmd. line:1: NR==FNR {T[]; next} {for (t in T) sub (t.*, t)} 1
awk: cmd. line:1:            ^ syntax error
awk: cmd. line:1: error: invalid subscript expression
awk: cmd. line:1: NR==FNR {T[]; next} {for (t in T) sub (t.*, t)} 1
awk: cmd. line:1:                                         ^ syntax error
awk: cmd. line:1: NR==FNR {T[]; next} {for (t in T) sub (t.*, t)} 1
awk: cmd. line:1:                                              ^ 1 is invalid as number of arguments for sub

Thank you
Boris

--- Post updated at 11:20 PM ---

Hello,
As I faced problems with awk, sorted out in below algorithm shortly:

-First line removed in sourcefile

Grep all lines containing COL1 in sourcefile > output1
Grep all lines not-containing COL1 in sourcefile > output2
Paste -d '\n' both output files

I am sorry for the suffer I caused.

Thank you
Boris

RudiC · January 26, 2019, 4:14am

baris35:

...

Hello Again,
Here is the output:

root@house:~/test# awk 'NR==FNR {T[]; next} {for (t in T) sub (t.*, t)} 1' readfile source
awk: cmd. line:1: NR==FNR {T[]; next} {for (t in T) sub (t.*, t)} 1
awk: cmd. line:1:            ^ syntax error
awk: cmd. line:1: error: invalid subscript expression
awk: cmd. line:1: NR==FNR {T[]; next} {for (t in T) sub (t.*, t)} 1
awk: cmd. line:1:                                         ^ syntax error
awk: cmd. line:1: NR==FNR {T[]; next} {for (t in T) sub (t.*, t)} 1
awk: cmd. line:1:                                              ^ 1 is invalid as number of arguments for sub

Thank you
Boris
...

Wouldn't it make serious sense to read and try to understand the error message(s)? And, compare your code to the proposal given in post #6 ... see the difference?

baris35 · January 26, 2019, 6:31am

No, I do not see difference when I run both seperately.
As I do not understand awk , I put codes given in #6 between echo to see what is printing:
s2.sh

while read COL1 COL2
do
echo " awk 'NR==FNR {T[$1]; next} {for (t in T) if ($1 ~ t) $0 = $1} 1' readfile source "
done<readfile

output

awk 'NR==FNR {T[]; next} {for (t in T) if ( ~ t) ./s2.sh = } 1' readfile source
awk 'NR==FNR {T[]; next} {for (t in T) if ( ~ t) ./s2.sh = } 1' readfile source
awk 'NR==FNR {T[]; next} {for (t in T) if ( ~ t) ./s2.sh = } 1' readfile source

When I run without echo :

#!bin/bash
#test1
http://www.aa.bb.cc
#test2
http://www.11.rr.cd
#test3
http://www.22.qq.fc
#!bin/bash
#test1
http://www.aa.bb.cc
#test2
http://www.11.rr.cd
#test3
http://www.22.qq.fc
#!bin/bash
#test1
http://www.aa.bb.cc
#test2
http://www.11.rr.cd
#test3
http://www.22.qq.fc

When I put the second code in #6,

s2.sh

while read COL1 COL2
do
echo " awk 'NR==FNR {T[$1]; next} {for (t in T) sub (t".*", t)} 1' readfile source "
done<readfile

output:

 awk 'NR==FNR {T[]; next} {for (t in T) sub (t.*, t)} 1' readfile source
 awk 'NR==FNR {T[]; next} {for (t in T) sub (t.*, t)} 1' readfile source
 awk 'NR==FNR {T[]; next} {for (t in T) sub (t.*, t)} 1' readfile source

When I run s2.sh without echo:

#!bin/bash
#test1
http://www.aa.bb
#test2
http://www.11.rr.cd
#test3
http://www.22.qq.fc
#!bin/bash
#test1
http://www.aa.bb
#test2
http://www.11.rr.cd
#test3
http://www.22.qq.fc
#!bin/bash
#test1
http://www.aa.bb
#test2
http://www.11.rr.cd
#test3
http://www.22.qq.fc

Sorted out with grep + sed + paste commands. Awk is more complicated for me.

Thank you
Boris

RudiC · January 26, 2019, 7:01am

What in the shown result of s2.sh does not satisfy your needs? Looks perfect to me, considering the code you presented.

The difference is the T array index is $1 in the working code, and empty in your error case. I desparately try to understand why "Awk is more complicated for" you and you forgo the efficient complete solutions presented to you in posts #7 or #8, falling back to highly inefficient band aid pseudo solutions.

Consider the case the bumper has fallen off your car. The professional repair shop grabs their MIG welder, welds the screw nuts back to the carrier beam, and with a ratchet screws the bumper back on. Amateurs use chewing gum to fill the gaps, and scotch tape to glue the bumper back.

awk (or perl , sed , etc.) is the MIG welder and the ratchet at your finger tips.

baris35 · January 26, 2019, 7:16am

So, should I tell awk to search which column to be looked up? Sed+grep are like medium frequency welding technology for me. So far, awk seems not-comprehensible, even after your detailed explanation. I need to read more and more..

Thank you for your time.
Boris

--- Post updated at 07:16 AM ---

Hello Rudic,
It requires:

perl -i -ne 'print unless ${$_}++' output

Don't worry, the problem solved with a bit longer way

Kind regards
Boris

nezabudka · January 27, 2019, 1:56pm

I apologize to the author of the topic for rejection.
I think too beautiful to work correctly.

I have simplified:

awk 'NR==FNR {T[$1]; next} {for (t in T) sub (t".*", t)} 1' readfile source

Although the condition of the task to remove the remaining substring after the match.
But logically, it is still necessary to store the address fully in which the entire match is found.
Example
result: http://www._aa.bb
result as I imagine it: http://www.aa.bb.cc

awk 'NR==FNR {T[$1]; next} {for (t in T) sub(t, "\r"t)
$0 = gensub(/\r([^ ]*).*/, "\\1", 1)} 1' readfile source

Regards to RudiC, yet the decision is very beautiful and concise.
There was something to learn

nezabudka · January 29, 2019, 1:52am

found another interesting solution. limited the record to the number of the matching field NF=i

awk 'NR==FNR {T[$0]; next} {for(t in T) {for(i=1;i<=NF;i++) if ($i ~ t) NF=i}} 1' readfile source*

baris35 · February 22, 2019, 7:22pm

Hello,
I was away. Have just tested the code and works nice.
Thank you so much

Kind regards
Boris