Grab text after pattern on the same line

data:

8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5i
PACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+
8nwR15UzfeZafaf2bGr8akx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjp
lVpOoMLXJ   BUNLOES="343433434343mgg3383028383983mgg383827173494"  #BGLOFaakx6VwxBX+NafafxJMWX
8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5i
PACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPAC
Ib5fafafEU24f3EOOjp

i have a huge data file that contains text similar to the above. the text is a very, very long one line and it wraps around several times.

what i want to do is grab only a specific text.

in the above, i want a command that will grab the following text:

BUNLOES="343433434343mgg3383028383983mgg383827173494"

and will turn the "mgg" text into new lines, so that the final output looks like this:

343433434343
3383028383983
383827173494

Im looking for something efficient and portable. so naturally, that would be sed. i found the following:

sed -n -e 's/^.*\(BUNLOES=.*#BGLOFa\)/\1/p' data.txt

this sed command starts attempting to grab from point "BUNLOES up until #BGLOFa.

The command almost seems to work, but not exactly. im trying to do everything with one command instead instead of piping to awk.

How about

sed -n '/^.*BUNLOES="/{s///;s/".*$//; s/mgg/\n/g;p}' file
343433434343
3383028383983
383827173494
1 Like

I get this error:

sed -n '/^.*BUNLOES="/{s///;s/".*$//; s/mgg/\n/g;p}' data.txt
sed: 1: "/^.*BUNLOES="/{s///; ...": extra characters at the end of p command

Try a semicolon after the p command.

1 Like

Hello SkySmart,

Could you please try following and let me know if this helps you.
1st code:

awk '{match($0,/BUNLOES.*[^"]/);VAL=substr($0,RSTART,RLENGTH);if(VAL){gsub(/BUNLOES="|"  #.*/,"",VAL);gsub(/mgg/,RS,VAL);print VAL}}'  Input_file
OR
awk '{
        match($0,/BUNLOES.*[^"]/);
        VAL=substr($0,RSTART,RLENGTH);
        if(VAL){
                gsub(/BUNLOES="|"  #.*/,"",VAL);
                num=split(VAL, a,"mgg");
                for(i=1;i<=num;i++){
                                        print a
                                   }
               }
     }
    '   Input_file

2nd code:

awk '{match($0,/BUNLOES.*[^"]/);VAL=substr($0,RSTART,RLENGTH);if(VAL){gsub(/BUNLOES="|"  #.*/,"",VAL);num=split(VAL, a,"mgg");for(i=1;i<=num;i++){print a}}}'   Input_file
OR
awk '{
        match($0,/BUNLOES.*[^"]/);
        VAL=substr($0,RSTART,RLENGTH);
        if(VAL){
                gsub(/BUNLOES="|"  #.*/,"",VAL);
                gsub(/mgg/,RS,VAL);
                print VAL
               }
     }
    '   Input_file

Thanks,
R. Singh

1 Like

is bash portable, under your definition?

 regex='BUNLOES=\"([0-9]+)mgg([0-9]+)mgg([0-9]+)\"'
 while read line
 do
      if [[ $line =~ $regex ]]; then
         echo "${BASH_REMATCH[1]}"
         echo "${BASH_REMATCH[2]}"
         echo "${BASH_REMATCH[3]}"
      fi
 done < filename
2 Likes
awk -F\" 'gsub(/mgg/,RS,$2){print $2}' file

or

awk -F\" '/BUNLOES/ && gsub(/mgg/,RS,$2){print $2}' file

---
Note: On Solaris use /usr/xpg4/bin/awk rather than awk

1 Like

Hello SkySmart,

Taking some inspiration from S and this could be helpful too on same.

awk '{gsub(/.*   BUNLOES="|"  #.*/,"");gsub(/mgg/,ORS);print}' RS=""   Input_file

If you want to check like both the strings BUNLOES and mgg are present as like shown Input_file then following additional check we could do and get it done then.

awk '{n=gsub(/.*   BUNLOES="|"  #.*/,"");n+=gsub(/mgg/,ORS);if(n==4){print}}' RS=""   Input_file

Thanks,
R. Singh

1 Like

this seems to work, but it doesnt put the numbers in new lines. i get them all in one line:

time sed -n '/^.*BUNLOES="/{s///;s/".*$//; s/mgg/\n/g;p;}' data.txt
'269118084457'n'3626086549312632'

i've tried repplacing the

\n 

with

\\n

but that still didnt work.

We can't see your terminal, nor do we know the versions of your OS, shell, nor sed . So - be creative. Look into your sed' s man or info page on how to escape control characters (mine accepts \n for a <new line> char). Try bash 's (if avaiable) $'string' mechanism. Try entering CTRL-V CTRL-J at the command line, but don't forget to escape this with a backslash.
If nothing helps, use one of the other proposals.

1 Like

Hi, SkySmart .

When you ask for that, I think it's important for you to tell us what the class of systems on which you need to run this for portability.

For example, some sed don't seem to run suggested code. So what would be the earliest accepted version of sed that you have available, etc., etc.?

Best wishes ... cheers, drl

1 Like

Note: the use of \n in the replacement part of the sed's substitute command is a GNU extension.
The absence of a semicolon between p and } is a GNU extension. The use of semicolon p;} is a permitted, but not required extension to the POSIX specification, so even that may not be supported by all implementations.

2 Likes

This (escaped explicit <NL> char) works on FreeBSD (which insists on the ; following p ):

sed -n '/^.*BUNLOES="/{s///; s/".*$//; s/mgg/\
/g; p; }' file
343433434343
3383028383983
383827173494
1 Like

RudiC's solution above works on my Solaris as well, which I think has non-GNU sed.

$ 
$ uname -a
SunOS solaris11-3 5.11 11.3 i86pc i386 i86pc
$ 
$ cat -n data.txt
     1    8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5i
     2    PACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+
     3    8nwR15UzfeZafaf2bGr8akx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjp
     4    lVpOoMLXJ   BUNLOES="343433434343mgg3383028383983mgg383827173494"  #BGLOFaakx6VwxBX+NafafxJMWX
     5    8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5i
     6    PACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPAC
     7    Ib5fafafEU24f3EOOjp
$ 
$ /usr/bin/sed -n '/^.*BUNLOES="/{s///;s/".*$//; s/mgg/\
/g;p;}' data.txt
343433434343
3383028383983
383827173494
$ 
$ /usr/xpg4/bin/sed -n '/^.*BUNLOES="/{s///;s/".*$//; s/mgg/\
/g;p;}' data.txt
343433434343
3383028383983
383827173494
$ 
$ 

I agree with drl that the OP should take stock of the systems and tools (s)he will be working with, in order to determine the portability of the solution.

1 Like

this is the command I ended up using and it works across all necessary platforms:

export LC_ALL=C ; time tr '`' '\n' < data.txt | sed -n '/BUNLOES=/p' | awk -F"BUNLOES=" '{print $2}' | awk '{print $1}' | sed -e 's_"__g' -e 's_ __g' -e '/^$/d' 2>/dev/null | awk '{gsub("mgg","\n");printf"%s",$0}' | sed -e "s/'/ /g" -e 's~ ~~g' | sed '/^$/d'

can someone please help me combine this into one command, if possible?

the content of data.txt is a very very log one line:

8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5i PACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+ 8nwR15UzfeZafaf2bGr8akx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjp lVpOoMLXJ   BUNLOES="'269118084457'mgg'3626086549312632'"mgg1659344516312337mgg1659344516304657mgg5851430858050896mgg2968137013313563  #BGLOFaakx6VwxBX+NafafxJMWX 8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5i PACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPAC Ib5fafafEU24f3EOOjp8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5i PACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5i8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5i PACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5i8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5iPACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxJMWX8iW5i PACIb5fafafEU24f3EOOjpakx6VwxBX+NafafxfafakJMWX8iW5i

When i run the command of this post, i get this:

269118084457
3626086549312632
1659344516312337
1659344516304657
5851430858050896
2968137013313563

Now i just want to combine all the commands into one.

If you have Perl in all your platforms, then here's one way:

$ 
$ perl -lne '/BUNLOES=(.*?)\s+/ and do{$x=$1; $x=~s/[\x{27}" ]//g; $x=~s/mgg/\n/g; print $x}' data.txt
269118084457
3626086549312632
1659344516312337
1659344516304657
5851430858050896
2968137013313563
$ 
$ 
1 Like
perl -nle 'BEGIN{$,="\n"}/BUNLOES=(.+?)\s/ and print $1=~/(\d+)/g' skysmart.file
269118084457
3626086549312632
1659344516312337
1659344516304657
5851430858050896
2968137013313563

or

perl -nle '/BUNLOES=(.+?)\s/ and print join "\n", $1=~/(\d+)/g' skysmart.file

or

perl -nle 'map{s/\D+/\n/g and print} /BUNLOES\D+(.+?)\s/' skysmart.file
1 Like

perl wont be present on some of these systems i have to run the command on. a lot of them are docker systems with the bare minimum.

Try :

awk 'p==s{gsub(qr,x); gsub(fs,ORS,$1); print $1}{p=$NF}' s=BUNLOES fs=mgg qr="'|\"" RS== file

--
@SkySmart: you just changed the specification in post #15, by introducing random single quotes and double quotes into the data, plus by mentioning that your data is just one very long line.

That means that all the people who reacted earlier, trying to help you, were not using the right data and therefore were unable to produce adequate code and were basically working for no purpose.

Please do not do that. Have the complete data specification ready when you create the thread..

3 Likes

Out of curiosity. Would any of these work?

grep -Eo '[0-9]{4,}' skysmart.file
grep -Po '\d{4,}' skysmart.file

Output:

269118084457
3626086549312632
1659344516312337
1659344516304657
5851430858050896
2968137013313563
1 Like