Grab text after pattern on the same line

SkySmart · March 19, 2017, 6:10am

data:

8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5i
PACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+
8nwR15UzfeZafaf2bGr8akx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjp
lVpOoMLXJ   BUNLOES="343433434343mgg3383028383983mgg383827173494"  #BGLOFaakx6VwxBX+NafafxJMWX
8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5i
PACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPAC
Ib5fafafEU24f3EOOjp

i have a huge data file that contains text similar to the above. the text is a very, very long one line and it wraps around several times.

what i want to do is grab only a specific text.

in the above, i want a command that will grab the following text:

BUNLOES="343433434343mgg3383028383983mgg383827173494"

and will turn the "mgg" text into new lines, so that the final output looks like this:

343433434343
3383028383983
383827173494

Im looking for something efficient and portable. so naturally, that would be sed. i found the following:

sed -n -e 's/^.*\(BUNLOES=.*#BGLOFa\)/\1/p' data.txt

this sed command starts attempting to grab from point "BUNLOES up until #BGLOFa.

The command almost seems to work, but not exactly. im trying to do everything with one command instead instead of piping to awk.

RudiC · March 19, 2017, 6:18am

How about

sed -n '/^.*BUNLOES="/{s///;s/".*$//; s/mgg/\n/g;p}' file
343433434343
3383028383983
383827173494

SkySmart · March 19, 2017, 6:33am

I get this error:

sed -n '/^.*BUNLOES="/{s///;s/".*$//; s/mgg/\n/g;p}' data.txt
sed: 1: "/^.*BUNLOES="/{s///; ...": extra characters at the end of p command

RudiC · March 19, 2017, 6:38am

Try a semicolon after the p command.

RavinderSingh13 · March 19, 2017, 8:05am

Hello SkySmart,

Could you please try following and let me know if this helps you.
1st code:

awk '{match($0,/BUNLOES.*[^"]/);VAL=substr($0,RSTART,RLENGTH);if(VAL){gsub(/BUNLOES="|"  #.*/,"",VAL);gsub(/mgg/,RS,VAL);print VAL}}'  Input_file
OR
awk '{
        match($0,/BUNLOES.*[^"]/);
        VAL=substr($0,RSTART,RLENGTH);
        if(VAL){
                gsub(/BUNLOES="|"  #.*/,"",VAL);
                num=split(VAL, a,"mgg");
                for(i=1;i<=num;i++){
                                        print a
                                   }
               }
     }
    '   Input_file

2nd code:

awk '{match($0,/BUNLOES.*[^"]/);VAL=substr($0,RSTART,RLENGTH);if(VAL){gsub(/BUNLOES="|"  #.*/,"",VAL);num=split(VAL, a,"mgg");for(i=1;i<=num;i++){print a}}}'   Input_file
OR
awk '{
        match($0,/BUNLOES.*[^"]/);
        VAL=substr($0,RSTART,RLENGTH);
        if(VAL){
                gsub(/BUNLOES="|"  #.*/,"",VAL);
                gsub(/mgg/,RS,VAL);
                print VAL
               }
     }
    '   Input_file

Thanks,
R. Singh

jim_mcnamara · March 19, 2017, 10:18am

is bash portable, under your definition?

 regex='BUNLOES=\"([0-9]+)mgg([0-9]+)mgg([0-9]+)\"'
 while read line
 do
      if [[ $line =~ $regex ]]; then
         echo "${BASH_REMATCH[1]}"
         echo "${BASH_REMATCH[2]}"
         echo "${BASH_REMATCH[3]}"
      fi
 done < filename

Scrutinizer · March 19, 2017, 11:30am

awk -F\" 'gsub(/mgg/,RS,$2){print $2}' file

or

awk -F\" '/BUNLOES/ && gsub(/mgg/,RS,$2){print $2}' file

---
Note: On Solaris use /usr/xpg4/bin/awk rather than awk

RavinderSingh13 · March 19, 2017, 11:41am

Hello SkySmart,

Taking some inspiration from S and this could be helpful too on same.

awk '{gsub(/.*   BUNLOES="|"  #.*/,"");gsub(/mgg/,ORS);print}' RS=""   Input_file

If you want to check like both the strings BUNLOES and mgg are present as like shown Input_file then following additional check we could do and get it done then.

awk '{n=gsub(/.*   BUNLOES="|"  #.*/,"");n+=gsub(/mgg/,ORS);if(n==4){print}}' RS=""   Input_file

Thanks,
R. Singh

SkySmart · March 19, 2017, 12:48pm

this seems to work, but it doesnt put the numbers in new lines. i get them all in one line:

time sed -n '/^.*BUNLOES="/{s///;s/".*$//; s/mgg/\n/g;p;}' data.txt

'269118084457'n'3626086549312632'

i've tried repplacing the

\n

with

\\n

but that still didnt work.

RudiC · March 19, 2017, 1:31pm

We can't see your terminal, nor do we know the versions of your OS, shell, nor sed . So - be creative. Look into your sed' s man or info page on how to escape control characters (mine accepts \n for a <new line> char). Try bash 's (if avaiable) $'string' mechanism. Try entering CTRL-V CTRL-J at the command line, but don't forget to escape this with a backslash.
If nothing helps, use one of the other proposals.

drl · March 19, 2017, 2:23pm

Hi, SkySmart .

When you ask for that, I think it's important for you to tell us what the class of systems on which you need to run this for portability.

For example, some sed don't seem to run suggested code. So what would be the earliest accepted version of sed that you have available, etc., etc.?

Best wishes ... cheers, drl

Scrutinizer · March 19, 2017, 2:49pm

Note: the use of \n in the replacement part of the sed's substitute command is a GNU extension.
The absence of a semicolon between p and } is a GNU extension. The use of semicolon p;} is a permitted, but not required extension to the POSIX specification, so even that may not be supported by all implementations.

RudiC · March 19, 2017, 2:58pm

This (escaped explicit <NL> char) works on FreeBSD (which insists on the ; following p ):

sed -n '/^.*BUNLOES="/{s///; s/".*$//; s/mgg/\
/g; p; }' file
343433434343
3383028383983
383827173494

durden_tyler · March 19, 2017, 3:31pm

RudiC's solution above works on my Solaris as well, which I think has non-GNU sed.

$ 
$ uname -a
SunOS solaris11-3 5.11 11.3 i86pc i386 i86pc
$ 
$ cat -n data.txt
     1    8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5i
     2    PACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+
     3    8nwR15UzfeZafaf2bGr8akx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjp
     4    lVpOoMLXJ   BUNLOES="343433434343mgg3383028383983mgg383827173494"  #BGLOFaakx6VwxBX+NafafxJMWX
     5    8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5i
     6    PACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPAC
     7    Ib5fafafEU24f3EOOjp
$ 
$ /usr/bin/sed -n '/^.*BUNLOES="/{s///;s/".*$//; s/mgg/\
/g;p;}' data.txt
343433434343
3383028383983
383827173494
$ 
$ /usr/xpg4/bin/sed -n '/^.*BUNLOES="/{s///;s/".*$//; s/mgg/\
/g;p;}' data.txt
343433434343
3383028383983
383827173494
$ 
$

I agree with drl that the OP should take stock of the systems and tools (s)he will be working with, in order to determine the portability of the solution.

SkySmart · March 19, 2017, 8:10pm

this is the command I ended up using and it works across all necessary platforms:

export LC_ALL=C ; time tr '`' '\n' < data.txt | sed -n '/BUNLOES=/p' | awk -F"BUNLOES=" '{print $2}' | awk '{print $1}' | sed -e 's_"__g' -e 's_ __g' -e '/^$/d' 2>/dev/null | awk '{gsub("mgg","\n");printf"%s",$0}' | sed -e "s/'/ /g" -e 's~ ~~g' | sed '/^$/d'

can someone please help me combine this into one command, if possible?

the content of data.txt is a very very log one line:

8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5i PACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+ 8nwR15UzfeZafaf2bGr8akx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjp lVpOoMLXJ   BUNLOES="'269118084457'mgg'3626086549312632'"mgg1659344516312337mgg1659344516304657mgg5851430858050896mgg2968137013313563  #BGLOFaakx6VwxBX+NafafxJMWX 8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5i PACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPAC Ib5fafafEU24f3EOOjp8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5i PACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5i8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5i PACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5i8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5i PACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxfafakJMWX8iW5i

When i run the command of this post, i get this:

269118084457
3626086549312632
1659344516312337
1659344516304657
5851430858050896
2968137013313563

Now i just want to combine all the commands into one.

durden_tyler · March 19, 2017, 9:45pm

If you have Perl in all your platforms, then here's one way:

$ 
$ perl -lne '/BUNLOES=(.*?)\s+/ and do{$x=$1; $x=~s/[\x{27}" ]//g; $x=~s/mgg/\n/g; print $x}' data.txt
269118084457
3626086549312632
1659344516312337
1659344516304657
5851430858050896
2968137013313563
$ 
$

Aia · March 19, 2017, 10:33pm

perl -nle 'BEGIN{$,="\n"}/BUNLOES=(.+?)\s/ and print $1=~/(\d+)/g' skysmart.file
269118084457
3626086549312632
1659344516312337
1659344516304657
5851430858050896
2968137013313563

or

perl -nle '/BUNLOES=(.+?)\s/ and print join "\n", $1=~/(\d+)/g' skysmart.file

or

perl -nle 'map{s/\D+/\n/g and print} /BUNLOES\D+(.+?)\s/' skysmart.file

SkySmart · March 20, 2017, 12:04am

perl wont be present on some of these systems i have to run the command on. a lot of them are docker systems with the bare minimum.

Scrutinizer · March 20, 2017, 12:32am

Try :

awk 'p==s{gsub(qr,x); gsub(fs,ORS,$1); print $1}{p=$NF}' s=BUNLOES fs=mgg qr="'|\"" RS== file

--
@SkySmart: you just changed the specification in post #15, by introducing random single quotes and double quotes into the data, plus by mentioning that your data is just one very long line.

That means that all the people who reacted earlier, trying to help you, were not using the right data and therefore were unable to produce adequate code and were basically working for no purpose.

Please do not do that. Have the complete data specification ready when you create the thread..

Aia · March 20, 2017, 12:36am

Out of curiosity. Would any of these work?

grep -Eo '[0-9]{4,}' skysmart.file

grep -Po '\d{4,}' skysmart.file

Output:

269118084457
3626086549312632
1659344516312337
1659344516304657
5851430858050896
2968137013313563