Shell Scripting help

i have a text file and having many lines and i need to fetch few details and paste it to another file in CSV format.

I am using this command to fetch the values.

grep 'pName="vin'  temp.txt | sed -n 's:.*<mis>\(.*\)</mis>.*<seg>\(.*\)</seg>.*:\1\,\2:p' 

and getting output as this

11111111,Pit
333333,zit

but how can i get dateTime in ouput as well (as mentioned in below format).

For Ex:
temp.txt (file)

<l:ev dateTime="2019-06-14 08:30" pName="vin"> <mis>11111111</mis><seg>Pit</seg> </l:ev>
<l:ev dateTime="2019-06-14 09:30" pName="sin"> <mis>222222</mis><seg>sit</seg> </l:ev>
<l:ev dateTime="2019-06-14 10:30" pName="vin"> <mis>333333</mis><seg>zit</seg> </l:ev>

output expected:

2019-06-14 08:30,11111111,Pit
2019-06-14 10:30,333333,zit

Also, one more thing if file is changed and it have contain like this(<val:mis>) as mentioned below then using above command is not working.

<l:ev dateTime="2019-06-14 08:30" pName="vin"> <val:mis>11111111</val:mis><val:seg>Pit</val:seg> </l:ev>

how about a brute-force approach - YMMV:

 awk -F'["<>]' '$5=="vin" {print $3, $9, $13}' OFS=, myFile.txt

Or an adaptation to the sed command:

sed -n 's:.*dateTime="\([^"]*\).*<mis>\(.*\)</mis>.*<seg>\(.*\)</seg>.*:\1\,\2,\3:p'

Thanks

Different approach, including your val: case:

sed -n 's/" pName="vin">//; T; s/^.*dateTime="//; s/<[^>]*>/,/g; s/[ ,]\{2,\}/,/gp' file
2019-06-14 08:30,11111111,Pit,
2019-06-14 10:30,333333,zit,
2019-06-14 08:30,11111111,Pit,

Thanks for this.......Can use AWK but sometimes for few lines sequence may change or an extra element might get added that may cause the issue with the expected output.

--- Post updated at 05:41 PM ---

Thanks for this ......but this is not working when val is added. For example if you file contains these:

<l:ev dateTime="2019-06-14 08:30" pName="vin"> <mis>11111111</mis><seg>Pit</seg> </l:ev>
<l:ev dateTime="2019-06-14 09:30" pName="vin"> <val:mis>222</val:mis><val:seg>sit</val:seg> </l:ev>
<l:ev dateTime="2019-06-14 09:30" pName="vin"> <nim:mis>222</nim:mis><nim:seg>sit</nim:seg> </l:ev>

Yes, it doesn't work, because your search string is "<mis>" and what the file contains is "<val:mis>" (and "<seg>" instead of "<val:seg>", etc.). It is rather obvious that you find only what you search for, nothing else. No?

But isn't it obvious how the command above must be changed to reflect the changes in your input? I am convinced that a brilliant young man like you can do that, can't you? Just show us what you tried.

bakunin

Hi Bakunin,

I was using this command and was getting expected output. but then for few lines i found html tags contains some prefix as well.

 grep 'pName="vin'  temp.txt | sed -n 's:.*<mis>\(.*\)</mis>.*<seg>\(.*\)</seg>.*:\1\,\2:p' 

So the prefix in few lines were different((like <s:mis> or <t:mis) ) but the parameter to grep is always same (mis or seg). I was trying to make the prefix optional using asterisk in command like this

 ( grep 'pName="vin'  temp.txt | sed -n 's:.*<*mis>\(.*\)</*mis>.*<*seg>\(.*\)</*seg>.*:\1\,\2:p' ) 

but it seems asterisk cant be used inside html tags to make it optional or i am not aware of.

--- Post updated at 03:47 AM ---

Thanks RudiC....This works perfectly but if sequence changes or any new element is there in lines that also gets printed. I was thinking of greping the specific parameter values (like mis, seg values) and for this i was using below command

grep 'pName="vin'  temp.txt | sed -n 's:.*<mis>\(.*\)</mis>.*<seg>\(.*\)</seg>.*:\1\,\2:p'

To consider prefixes as well before mis or seg i was trying asterisk to make it optional but seems the way i am using is not correct.

grep 'pName="vin'  temp.txt | sed -n 's:.*<*mis>\(.*\)</*mis>.*<*seg>\(.*\)</*seg>.*:\1\,\2:p'

Sequence or prefix may change this way.

<l:ev dateTime="2019-06-14 08:30" pName="vin"> <mis>11111111</mis><seg>Pit</seg> </l:ev>
<l:ev dateTime="2019-06-14 09:30" pName="vin"> <val:xyz>4444</val:xyz><val:seg>sit</val:seg><val:mis>222</val:mis>< </l:ev>
<l:ev dateTime="2019-06-14 09:30" pName="vin"> <n:mis>222</n:mis><n:seg>sit</n:seg> </l:ev>

That is not the reason at all. In fact you should read about (POSIX basic) regular expressions, because you obviously don't correctly understand how they work:

The asterisk ("*") makes the previous expression optional, but it doesn't match anything in itself. It means "zero or more occurrences of what comes before". Here is an example:

The regular expression "abcd" matches a fixed string, "a", followed by "b", followed by "c", followed by "d". Now, if you change it to "abc*d" its meaning changes to: "a", followed by "b" followed by zero or more occurrences of "c", followed by "d". Here is an example list of strings that would be matched by this expression:

"abd"
"abcd"
"abccd"
"abccccccd"
etc.

Now, in light of this, read your regexp again:

s:.*<*mis>...

What you did by inserting the "*" after the "<" was to make the "<" optional. Instead of exactly one "<" you now match any number of "<", including zero (that makes it optional). But what you want is to match the "<", then anything that might precede a ":" including the ":" itself. To phrase it differently: a "<", then zeror or more occurrences of "something, followed by a ":", then what you already matched.

So, let us take you original regexp:

<mis>

and change it to the specification above. First: something, followed by a ':" - or, more robustly, any number of any character save for a ":", followed by a ":" is:

[^:]*:

Let us put that in:

<[^:]*:mis>

Next, we need "zero or more" occurrences of this whole group" and therefore we need to first group it to be able to address it with a single asterisk, hence:

<\([^:]*:\)*mis>

Note, that groups are numbered automatically, so you may need to replace "\1", "\2", etc. in your replacement string with other numbers maybe.

On a side note: you don't need the grep at all because sed can do that itself:

Change:

 grep 'pName="vin'  temp.txt | sed -n 's:.*<mis>\(.*\)</mis>.*<seg>\(.*\)</seg>.*:\1\,\2:p'

to

sed -n '/pName="vin/ s:.*<mis>\(.*\)</mis>.*<seg>\(.*\)</seg>.*:\1\,\2:p'

I hope this helps.

bakunin

1 Like

Thank you bakunin for such a great explanation....This is really going to help me in learning shell.

One thing if i use

<\([^:]*:\)*mis>

then the result will exclude the lines which doesn't have colon (<mis>11111111</mis>). So can you help me in getting both the results and how can i use this in command.

i was trying this but not sure on how to use this in group

sed -n 's/<[^:>]*:mis>/,/g; s/[,]\{1,\}/,/gp' temp2.txt

Actually: no. Analyse the regexp carefully, i will put in a few extra spaces for emphasis:

<    \([^:]*:\)*  mis>

So you have: < , which is simply a (fixed string of one) character and at the end mis> , which is also a fixed string. This matches <{something}mis> , yes?

Now, let us get to the interesting part, the middle expression, which will match the {something} : inside the grouping we have [^:]*: . That means: zero or more non-":" characters, followed by a ":". So, it would match (list of examples):

:
t:
bla:
something:
a list of words:
etc....

Now, as we have grouped that and put an asterisk at the end, we can have OR can not have such an expression before the "mis". Hence we match (putting it all together:

<mis>              # in this case the expression \([^:]*:\)* occurs simply zero times - not at all
<t:mis>            # [^:]* covers the "t", the ":" covers the ":" and the whole \([^:]*:\) occurs one time
<bla-foo:mis>      # [^:]* covers the "bla-foo", the ":" covers the ":" and the whole \([^:]*:\) occurs one time
<bla:foo:mis>      # [^:]* covers the "bla" (first) and "foo" (second), the ":" covers the ":" and the whole \([^:]*:\) occurs two times

you see from the last example that there is still room for making the regexp more specific, but i didn't want to confuse you with too much information at once. Maybe this is all the precision you need anyway - only you know your data and can know that. If you would need the additional precision to not match the last example you can do that:

<\([^:]*:\)\{0,1\}mis>

The \{0,1\} works similar to the asterisk, but instead of zero or more occurrences it specifies zero or more but at most one occurrence (this sounds like i'm phrasing it more difficult than necessary but you can change the numbers so that other ranges of allowed occurrences are required).

I hope this helps.

bakunin

Thank you so much......Such a nice explanation and i am really learning from these detailed explanation.

How can this regex expression be used in sed command to fetch the values. Earlier i was using this cmd and when i am changing it with new regex exp i am getting some syntax errors

sed -n '/pName="vin/ s:.*<mis>\(.*\)</mis>..*:\1:p' temp2.txt

Also one more thing if a single line contains same element twice or any number of time (Not known), how can i get all values separated by any delimiter.

<l:ev dateTime="2019-06-14 09:30" pName="vin"> <val:mis>222</val:mis><val:seg>sit</val:seg> <val:mis>333</val:mis> </l:ev>
output
222-333

How about perl?

perl -lne 'm#pName="vin"# and m#dateTime="(.*?)".*?<(.*?:)?mis>(.*?)</# and print "$1 $3"' temp.txt

The .*? is a minimum match, as opposed to the .* greedy match.
The m (match) operator lets you set the delimiter, here # . /expr/ is default i.e. like m/expr/ .
grouping works with ( ) in ERE style (like egrep or grep -E). Each group can be referred as $1 $2 ...
(.*?:)? is an optional prefix.

This is really good but i am not able to get this.

if a single line contains same element twice or more number of time (Not known), how can i get all values separated by any delimiter.
for ex:

<l:ev dateTime="2019-06-14 09:30" pName="vin"> <nim:mis>222</nim:mis><nim:seg>sit</nim:seg> </l:ev>
<l:ev dateTime="2019-06-14 09:30" pName="vin"> <val:mis>4444</val:mis><val:seg>sit</val:seg> <val:mis>333</val:mis> </l:ev>

output using perl command:

perl -lne 'm#pName="vin"# and m#dateTime="(.*?)".*?<(.*?:)?mis>(.*?)</# and print "$1 $3"' temp2.txt
2019-06-14 09:30 222
2019-06-14 09:30 4444

Expected Output:

2019-06-14 09:30,222
2019-06-14 09:30,4444 & 333

It's all doable in perl.

perl -lne 'm#pName="vin"# and @dt=m#dateTime="(.*?)"# and @mis=m#mis>(.*?)</.*?mis>#g and print $dt[0], " , ", join(" & ", @mis)' temp2.txt

One can store the output of a match in an array. And when printing the array let the join function insert separators.

1 Like

To also fulfill your earlier requirement, here is an addon that also joins multiple <seg> values:

perl -lne 'm#pName="vin"# and @dt=m#dateTime="(.*?)"# and @mis=m#mis>(.*?)</.*?mis>#g and @seg=m#seg>(.*?)</.*?seg>#g and print $dt[0], ",", join(" & ", @mis), ",", join(" & ", @seg)' temp2.txt

Now I think you should be able to further extend this yourself...

1 Like

Many thanks for this.

I was trying to grep this payload as well using perl but not able to get the expected result.

<l:ev dateTime="2019-06-14 09:30" pName="vin"> <n:mis>222</n:mis><n:seg>sit</n:seg> <Response><a><ErrorD>Err-116</ErrorD></a><a><ErrorD>Err-117</ErrorD></a>><a><ErrorD>Err-116</ErrorD></a></Response> <Logging><a><ErrorD>Err-116</ErrorD></a><a><ErrorD>Err-117</ErrorD></a>><a><ErrorD>Err-118</ErrorD></a></Logging></l:ev>

Here i was expecting this output like ErrorD to come once only for Response section not for Logging section.
Expected Output:
222,sit,Err-116 & Err-117 & Err-118

For this i was trying these but seems not correct.

perl -lne 'm#pName="vin"# and @dt=m#dateTime="(.*?)"# and @mis=m#mis>(.*?)</.*?mis>#g and @seg=m#seg>(.*?)</.*?seg>#g and  @MError=m#Response.*?>.*?<.*?ErrorD.*?>(.*?)</.*?ErrorD>.*?</.*?Response>#g  and print $dt[0], ",", join(" & ", @mis), ",", join(" & ", @seg)', ",", join(" & ", @MError)' temp2.txt

perl -lne 'm#pName="vin"# and @dt=m#dateTime="(.*?)"# and @mis=m#mis>(.*?)</.*?mis>#g and @seg=m#seg>(.*?)</.*?seg>#g and  @MError=m#ErrorD.*?>(.*?)</.*?ErrorD>#g  and print $dt[0], ",", join(" & ", @mis), ",", join(" & ", @seg)', ",", join(" & ", @MError)' temp2.txt

Thanks,
Nitish