How to extract every repeated string between two specific string?

Hello guys,

I have problem with hpux shell script. I have one big text file that contains like

SOH
bla bla bla
bla bla bla
ETX                SOH
bla bla bla
ETX
SOH
bla bla bla
ETX

What I need to do is save first SOHBLA into file1.txt, save second SOHBLA into file2.txt and so on. Please help me, I don't know how can I do this recursively.

Welcome to forums please show expected output as well

Hi,

Expected output is file1.txt that contains first tagged text, file2.txt that contains second tagged text and so. I'm trying to get text between 2 tags using sed. But really don't know how to do it recursive.

file1.txt

SOH
bla bla bla
bla bla bla
ETX

file2.txt

SOH
bla bla bla
ETX

try:

awk '
length { print RS $0 > "file" ++n ".txt"
  close ("file" n ".txt")
}' RS="SOH" infile

Hi,

Thanks for your reply. It's splitting whole text into seperate files but not extracting all text between SOH and ETX tags.

How about... awk '/SOH/,/ETX/' file1.txt > file2.txt

Hi, I have already tried it without luck.

You haven't told us what OS or shell you're using. You have told us some suggestions aren't working, but you haven't supplied any details about how they failed. You have shown us sample input with 3 SOH ... ETX pairs, but the sample output you say you want from that input only has two output files. You haven't said what should happen to text before the 1st SOH, between one ETX and the next SOH, nor after the last ETX. Nonetheless, the following seems to do what you want (making several wild assumptions):

awk '
{	while(length)
		if(soh) {
			# We have already seen SOH...
			# Copy text until we find ETX.
			if(etx = index($0, "ETX")) {
				# ETX found...  print through ETX, close output
				# file, clear soh, and throw away te part of the
				# line we have already processed.
				printf("%s\n", substr($0, soh, etx - soh + 3)) > f
				close(f)
				soh = 0
				$0 = substr($0, etx + 3)
			} else {# ETX not found...  print rest of the line.
				printf("%s\n", substr($0, soh)) > f
				soh = 1
				next
			}
		} else {# Look for SOH...
			if(soh = index($0, "SOH")) {
				# SOH found... set output filename...
				f = "file" ++nof ".txt"
				continue
			} else {# SOH not found...
				next
			}
		}
}' file

If you want to use this on a Solaris/SunOS system, change awk to /usr/xp4/bin/awk , /usr/xp6/bin/awk , or nawk .

2 Likes

Hello,

Sorry for bad description. Shell I used is /sbin/sh default shell of HPUX, os is HPUX. I have a big file that contains many tagged text.

SOH
...
ETX
SOH
...
ETX
.
.
.
ETX

I need to split this one big file into seperate text files. Per tagged text must be in the one text file. Thanks for your help. I' gonna try it.

/sbin/sh is not the default shell of HP-UX unless HP has changed recently its strategy... the default shell is /usr/bin/sh which is a posix compatible shell or /usr/bin/ksh .
/sbin/sh is as its directory shows is not intented to be used by users other than root since its a true basic bourne shell with nothing more it was compiled statically so root can work in maintenance mode without having having more than / only mounted...

Hi.

Standard HPUX utility csplit is designed for this. For example, assuming a well-formatted file, this script:

#!/usr/bin/env bash

# @(#) s1       Demonstrate context splitting, csplit.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C csplit

# Remove debris from previous runs.
rm -f xx*

FILE=${1-data1}

pl " Input data file $FILE:"
cat $FILE

pl " Results:"
# -z only in GNU/Linux.
# csplit -k -z $FILE '/^SOH/' '{*}'
csplit -k $FILE '/^SOH/' '{*}'
ls -lgo xx*

EXAMPLE=xx02
pl " Content for example file $EXAMPLE:"
cat $EXAMPLE

exit 0

will produce:

$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: HP-UX, B.11.11, 9000/785
Distribution        : GenericSysName [HP Release B.11.11] (see /etc/issue)
GNU bash 4.2.37
csplit - ( /usr/bin/csplit Nov 14 2000 )

-----
 Input data file data1:
SOH
Stuff in first section.
ETX
SOH
Stuff in second section.
ETX
SOH
Stuff in last section.
ETX

-----
 Results:
0
32
33
31
-rw-r--r--   1       0 Oct  8 18:29 xx00
-rw-r--r--   1      32 Oct  8 18:29 xx01
-rw-r--r--   1      33 Oct  8 18:29 xx02
-rw-r--r--   1      31 Oct  8 18:29 xx03

-----
 Content for example file xx02:
SOH
Stuff in second section.
ETX

Best wishes ... cheers, drl

2 Likes

Hi dri,
Unfortunately, if you look at the sample data shown in the 1st post in this thread, the data:

SOH
bla bla bla
bla bla bla
ETX                SOH
bla bla bla
ETX
SOH
bla bla bla
ETX

is NOT what you call well-formatted. The spaces between the ETX and the SOH on the line marked in red apparently are not supposed to appear in any of the output files.

Hi, Don.

Yes, I noticed that in the original post, but it was re-posted in #9, so that was the form I used.

I certainly agree that csplit as written in my script would not handle the original data, so I leave it up to sembii to decide which format is correct, which, in turn would suggest the appropriate solution ... cheers, drl

How about this this:

awk '
length {
  sub("ETX.*", "ETX")
  print RS $0 > "file" ++n ".txt"
  close ("file" n ".txt")
}' RS="SOH" infile

input:

SOH
bla bla bla
bla bla bla
ETX IGNORE THIS    SOH
bla bla bla
ETX
SOH
bla bla bla
ETX
----file1.txt---
SOH
bla bla bla
bla bla bla
ETX

----file2.txt---
SOH
bla bla bla
ETX

----file3.txt---
SOH
bla bla bla
ETX
1 Like

Although this will work well with come versions of awk , the standards say that the behavior of awk is unspecified if the value of RS contains more than one character. On OS X (and several other systems), the above code will just use "S" as the record separator and ignore the "OH".

2 Likes

Hello guys,

Thanks for your valueble help. I did some investigation and tried all of your suggested solutions. So I finally solved my problem. Here is the solution that I used.

N='0'
ls *.[Tt][Xx][Tt] | while read line
do
cat $line > file$N
csplit -f $line file$N '/^SOH/' '{*}'
N=$((N+1))
done

Thanks again.

The following should do exactly the same thing and user fewer system resources:

N='0'
for line in *.[Tt][Xx][Tt]
do
cat $line > file$N
csplit -f $line file$N '/^SOH/' '{*}'
N=$((N+1))
done

In cases where *.[Tt][Xx][Tt] expands to a long list of files, the for loop should still work correctly even though the ls command could fail due to ARG_MAX limitations.

Hi Don Cragun,

Yes, your right. In this environment, all text files that we want to split are located in separate directory. And their names include date like 201410201653.INP, so I have changed list command to below.

cd /path/to/directory/
*
ls 20*.INP | while read line

Also output files of csplit named incorrectly. Example, 201410201653.INP01, ***02, **03 etc. So I'm going to rename them in script like 201410201653_01.INP

Also, I have added '-n 3' option to csplit command. Because csplit max output limit reached at 99 files.

csplit -f $line -n 3 file$N '/^SOH/' '{*}'

---------- Post updated at 05:16 PM ---------- Previous update was at 05:00 PM ----------