Matching only the strings I provide - sed

jozo95 · January 14, 2016, 5:07am

Hello..

I am currently learning sed and have found myself in some trouble..

I wrote this command:

sed -ne 's/[^-<>]*\([-><]\{2\}[stockholm,paris,tokyo]*[-><]\{2\}[stockholm,paris,tokyo]*[-><]\{2\}[stockholm,paris,tokyo]*\).*\([-><]\{2\}[stockholm,paris,tokyo]*[-><]\{2\}[stockholm,paris,tokyo]*[-><]\{2\}[stockholm,paris,tokyo]*\).*/\1\2/p'

and some of the output i get is :

->stockholm->paris<-stockholmpi<-tokyo->paris<-stockholmpi
->stockholm<-stockholm->tokyo<-tokyo<-paris->stockholmtao
->paris<-stockholm<-tokyo<-paris<-tokyo->stockholm
<-tokyo<-stockholm->tokyo<-tokyo->stockholm->paris

As you can see, at the very end, it does not end with stockholm/paris/tokyo, because it still matches those extra letters because of my patter, now, how would I change my pattern to avoid these troubles ?

I tried (stockholm|tokyo|paris) but then I dont get the last city, stockholmpi for example (it should be stockholm only).

EDIT: Here is some of the data I use:

Wed3.14153<-paris<-stockholm->tokyo'->paris<-stockholm->parisphi$$
fubartao<-tokyo<-stockholm<-tokyoJul->paris->tokyo<-parisRed3.14153
$chi<-tokyo<-paris<-stockholmMar->tokyo<-stockholm->tokyoGreen 
Feb3.14153<-tokyo->tokyo<-parisBLACK<-paris<-tokyo->tokyoMar 
1011102.8<-stockholm<-tokyo<-tokyoblah<-stockholm<-stockholm<-tokyo3.14153001111
taoBLACK<-tokyo->paris->paris ->stockholm<-paris->stockholmThu3.14153
MayJun<-paris->paris<-stockholmSun->stockholm->tokyo->stockholm011011Green
NILLNULL->tokyo<-paris<-parisSep->stockholm->tokyo<-parisJunFri
AugFeb->stockholm<-stockholm->parisBLACK<-tokyo<-paris<-tokyoVOIDpi
 <-paris->paris->parisfoo->stockholm->paris->stockholm$NULL
chi3.14153<-paris<-paris<-tokyofoo<-stockholm<-paris->stockholm`100110
foo$$<-tokyo<-stockholm<-stockholm101101<-paris<-tokyo<-tokyo"Purple
fubarPurple->tokyo<-paris->paris ->tokyo<-paris<-tokyo`3.14
BlueMay->paris->stockholm<-stockholmVOID->stockholm->paris<-tokyoYellowphi
0101002.8<-tokyo->paris<-tokyotao<-tokyo<-tokyo->stockholmfooNULL
RedWed->paris->paris<-stockholmNILL<-tokyo<-paris->tokyoPurple 
100100$$$->paris->paris<-tokyo001011<-paris->paris->tokyoMonSep
Jan010001->paris->paris<-stockholmAug->tokyo<-paris->stockholmPurpleSep
->paris->paris<-tokyoblah<-stockholm<-stockholm<-paris010001tao
Purplefubar->stockholm<-paris->tokyoDec->paris->stockholm->tokyo$3.1415
010001->paris<-stockholm->tokyoVOID->tokyo<-stockholm<-tokyoMarFeb
SunFri->tokyo->paris<-tokyoJan->paris<-stockholm->tokyoWHITEMon

EDIT After RudiC's post:

Okay so the logic behind this pattern is,

It starts with a '->' or a '<-' followed by a city, example; ->tokyo.
After the city comes another arrow followed by another city, example; ->tokyo->paris.
Then again, an arrow, followed by a city, example; ->tokyo->paris<-tokyo.
Then some random texts come between, if you look at the last line in the data ive posted, you can see that after " ->tokyo->paris<-tokyo" comes "Jan" which is random text, we dont want this.
Then we meet our pattern again, same pattern as the previous.

This is the ideal result: ->tokyo->paris<-tokyo->paris<-stockholm->tokyo
Which I do get on this specific line, but on some other lines I get output like this:

 ->stockholm->paris<-stockholmpi<-tokyo->paris<-stockholmpi

And we see that the third city has two extra letters (pi) and the last city, has two extra letters (pi), that is because in my pattern i write :

[stockholm,paris,tokyo]*

which in turn matches 'p' and 'i' from paris.

Now how would I force sed to choose between the exact strings I provided, which is stockholm,paris and tokyo ?

EDIT: Solved it by using parantheses. Here is the solution:

sed -ne 's/[^-<>]*\([-><]\{2\}[stockholm,paris,tokyo]*[-><]\{2\}[stockholm,paris,tokyo]*[-><]\{2\}[stockholm,paris,tokyo]*\).*\([-><]\{2\}[stockholm,paris,tokyo]*[-><]\{2\}[stockholm,paris,tokyo]*[-><]\{2\}
\(stockholm\|paris\|tokyo\)\{1\}\).*/{Phil}2053,\1{5872Phil}\2[->->]/p' datasets/q14target.txt

RudiC · January 14, 2016, 5:27am

That spec is not too helpful. Please describe the desired results and the logics/algorithm to achieve them.

jozo95 · January 14, 2016, 5:38am

Edited my post, thanks.

RudiC · January 14, 2016, 6:10am

Why that? Is "Jan" acceptable?

I'm afraid that without a list of acceptable cities, there's no chance to remove random text.

jozo95 · January 14, 2016, 6:12am

Sorry, only "stockholm", "tokyo", "paris" is acceptable, I made a miss, sorry for that.

RudiC · January 14, 2016, 6:44am

I don't think sed can solve this efficiently. How about awk ? Try

awk '
BEGIN   {for (n=split("stockholm paris tokyo", T); n>0; n--) C[T[n]] = n
        }
        {for (i=1; i<=NF; i++)  {V = ""
                                 for (c in C) if ($i ~ c) V = c
                                 sub (/[^<>]+/, V, $i)
                                 printf "%s%s", $i, (i<NF)?FS:""
                                }
         printf "\n"
        }
' FS="-" file
<-paris<-stockholm->tokyo->paris<-stockholm->paris
<-tokyo<-stockholm<-tokyo->paris->tokyo<-paris
<-tokyo<-paris<-stockholm->tokyo<-stockholm->tokyo
<-tokyo->tokyo<-paris<-paris<-tokyo->tokyo
<-stockholm<-tokyo<-tokyo<-stockholm<-stockholm<-tokyo
<-tokyo->paris->paris->stockholm<-paris->stockholm
.
.
.

---------- Post updated at 12:44 ---------- Previous update was at 12:42 ----------

Or even

awk '
BEGIN   {for (n=split("stockholm paris tokyo", T); n>0; n--) C[T[n]] = n
        }
        {for (i=1; i<=NF; i++)  {V = ""
                                 for (c in C) if ($i ~ c) V = c
                                 sub (/[^<>]+/, V, $i)
                                }
        }
1
' FS="-" OFS="-" file

jozo95 · January 14, 2016, 6:53am

I kinda need to use sed for this assignment

RudiC · January 14, 2016, 6:56am

Homework and coursework questions can only be posted in the Homework & Coursework forum under special homework rules.

Please review the rules, which you agreed to when you registered, if you have not already done so.

Please repost your problem in the correct forum with a completely filled out homework template.

jozo95 · January 14, 2016, 6:58am

Ow, sorry, I saw that this section had sed in the description so I thought it would be appropraite.
Ill repost my thread in the propper section.