Split content based on keywords

Jairaj · March 12, 2019, 3:38am

I need to split the file contents with multiple rows based on patterns

Sample:
Input:

ABC101testXYZ102UKMNO1092testing
ABC999testKMNValid

Output:

ABC101test
XYZ102U
KMN1092testing
ABC999test
KMNValid

In this ABC , XYZ and KMN are patterns

nezabudka · March 12, 2019, 6:45am

the last my example is not entirely correct

sed 's/ABC\|XYZ\|KMN/\n&/g;s/^\n//' file

--- Post updated at 13:45 ---

And the first one is better to correct

sed -r 's/\B(ABC|XYZ|KMN)/\n&/g' file

Jairaj · March 12, 2019, 8:16pm

It's working.Thanks !

Can you tell me how this statement(coomand) flow will work ?

RavinderSingh13 · March 12, 2019, 10:46pm

Hello Jairaj,

In awk , could you please try following.

awk '{gsub("ABC|XYZ|MNO|KMN",ORS"&");sub(/^\n/,"")} 1'  Input_file

Thanks,
R. Singh

Jairaj · March 12, 2019, 11:28pm

It's working.Thanks !

Can you tell me how this statement(coomand) flow will work ?

nezabudka · March 12, 2019, 11:44pm

Hi Jairaj,
I'm sorry, I have problems with English, I can not.
Enter this command in the terminal

LESS=+/" *s/regexp/replacement/" man sed

bakunin · March 14, 2019, 6:57am

If i may try?

sed 's/ABC\|XYZ\|KMN/\n&/g;s/^\n//' file

This sed -program consists of two statements which are applied one after the other to every line:

s/ABC\|XYZ\|KMN/\n&/g
s/^\n//

Let us start with the second one as it is easier: it is a "replacement" command and replaces one expression with another. Actually the "s" stands for "substitute":

s/<something to match>/<something that replaces what was matched>/

What does it replace? It replaces a start-of-line ( ^ ) followed by a newline character ( \n ) with nothing. The start-of-line is not really a character, so effectively it deletes a newline character, should it follow a line start but no other newline characters.

The first line is a bit more complicated: basically it is a replacement command too and works the same way as the second line. Now, what does it replace?

/ABC\|XYZ\|KMN/

This matches one of the strings separated by the escaped pipe-characters, so effectively it matches either "ABC" or "XYZ" or "KMN". Now, what will these strings be replaced with?

/\n&/

The first is a \n , which means a newline character. The second character, & , means what has been matched before. As i said the first expression will match one of three different strings. The string which was matched in the first expression is put here so effectively it replaces the string with itself plus a newline character up front.

The final g is just an option and says that the operation should occur as often as possible and not only for the first opportunity. If you have a substitution command like:

s/a/b/

It will replace "a" with "b" but only the first occurence of "a". An input string of "aaa" will become "baa", but with the "g" in place it will become "bbb" because all the "a"s will be replaced, not only the first one. So, to put it all together, this is waht will happen to an input string:

# input string:
ABC101testXYZ102UKMNO1092testing

# after first command (newlines are encoded as "\n" for better understanding):
\nABC101test\nXYZ102U\nKMNO1092testing

# after the second command:
ABC101test\nXYZ102U\nKMNO1092testing

# what will really be written (newlines not encoded any more):
ABC101test
XYZ102U
KMNO1092testing

Notice that the use of Extended Regular Expressions as well as the usage of "\n" as a newline character is not covered by a standard-conforming sed .

There are several (similar but not identical) regular expression engines used in UNIX/Linux:

The most basic "regular expressions" although they are usually called "file globs" are used by the shell: i.e. the expression filename* where "*" is expanded to any string of any length is an example of this regexp syntax.

Then there are Basic Regular Expressions or "BRE"s. The syntax of BREs is standardized by POSIX and is used in utilities like sed , grep (in its default mode, see below) and so on.

Notice that the GNU project deviated from this standard and developed their own variant of BREs, the GNU Basic Regular Expressions. The GNU variants of sed , grep and so on use these instead of the POSIX BREs. One example for the difference between the GNU-BREs and POSIX-BREs is the quantifier "+", which means "one or more (of the previous expression". For instance, the regexp:

/Xa*Y/

will match "XaY", "XaaY" and so on, but also "XY". To exclude that latter and restrict the pattern to one or more "a" you would need to write

/Xaa*Y/         # POSIX, variant 1
/Xa\{1,\}Y/     # POSIX, variant 2
/Xa+Y/          # GNU

Notice that the two POSIX variants are understood by all regexp engines, the GNU variant is understood only by GNU-tools.

Then there are Extended Regular Expression or EREs. EREs are basically a superset of BREs but with a few quirks. For instance you do not escape grouping or numerical quantifiers:

/Xa\{1,\}Y/     # BRE
/Xa{1,}Y/       # ERE
/X\(abc\)*Y/       # BRE
/X(abc)*Y/         # ERE

There is a POSIX standard for these and they are used in utilities like awk , grep -E (the -E option switches the used regexp engine from BRE to ERE), egrep (this is basically a grep with the -E option set and fixed) and so on.

Again, GNU has its own variant of ERE called GNU-ERE and used in the respective GNU variants of GNU- awk , GNU- egrep , etc. but also GNU- sed when used with the "-E" or the equivalent "-r"-switch.

I hope this helps.

bakunin