I have problem with hpux shell script. I have one big text file that contains like
SOH
bla bla bla
bla bla bla
ETX SOH
bla bla bla
ETX
SOH
bla bla bla
ETX
What I need to do is save first SOHBLA into file1.txt, save second SOHBLA into file2.txt and so on. Please help me, I don't know how can I do this recursively.
Expected output is file1.txt that contains first tagged text, file2.txt that contains second tagged text and so. I'm trying to get text between 2 tags using sed. But really don't know how to do it recursive.
You haven't told us what OS or shell you're using. You have told us some suggestions aren't working, but you haven't supplied any details about how they failed. You have shown us sample input with 3 SOH ... ETX pairs, but the sample output you say you want from that input only has two output files. You haven't said what should happen to text before the 1st SOH, between one ETX and the next SOH, nor after the last ETX. Nonetheless, the following seems to do what you want (making several wild assumptions):
awk '
{ while(length)
if(soh) {
# We have already seen SOH...
# Copy text until we find ETX.
if(etx = index($0, "ETX")) {
# ETX found... print through ETX, close output
# file, clear soh, and throw away te part of the
# line we have already processed.
printf("%s\n", substr($0, soh, etx - soh + 3)) > f
close(f)
soh = 0
$0 = substr($0, etx + 3)
} else {# ETX not found... print rest of the line.
printf("%s\n", substr($0, soh)) > f
soh = 1
next
}
} else {# Look for SOH...
if(soh = index($0, "SOH")) {
# SOH found... set output filename...
f = "file" ++nof ".txt"
continue
} else {# SOH not found...
next
}
}
}' file
If you want to use this on a Solaris/SunOS system, change awk to /usr/xp4/bin/awk , /usr/xp6/bin/awk , or nawk .
/sbin/sh is not the default shell of HP-UX unless HP has changed recently its strategy... the default shell is /usr/bin/sh which is a posix compatible shell or /usr/bin/ksh . /sbin/sh is as its directory shows is not intented to be used by users other than root since its a true basic bourne shell with nothing more it was compiled statically so root can work in maintenance mode without having having more than / only mounted...
$ ./s1
Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: HP-UX, B.11.11, 9000/785
Distribution : GenericSysName [HP Release B.11.11] (see /etc/issue)
GNU bash 4.2.37
csplit - ( /usr/bin/csplit Nov 14 2000 )
-----
Input data file data1:
SOH
Stuff in first section.
ETX
SOH
Stuff in second section.
ETX
SOH
Stuff in last section.
ETX
-----
Results:
0
32
33
31
-rw-r--r-- 1 0 Oct 8 18:29 xx00
-rw-r--r-- 1 32 Oct 8 18:29 xx01
-rw-r--r-- 1 33 Oct 8 18:29 xx02
-rw-r--r-- 1 31 Oct 8 18:29 xx03
-----
Content for example file xx02:
SOH
Stuff in second section.
ETX
Hi dri,
Unfortunately, if you look at the sample data shown in the 1st post in this thread, the data:
SOH
bla bla bla
bla bla bla
ETX SOH
bla bla bla
ETX
SOH
bla bla bla
ETX
is NOT what you call well-formatted. The spaces between the ETX and the SOH on the line marked in red apparently are not supposed to appear in any of the output files.
Yes, I noticed that in the original post, but it was re-posted in #9, so that was the form I used.
I certainly agree that csplit as written in my script would not handle the original data, so I leave it up to sembii to decide which format is correct, which, in turn would suggest the appropriate solution ... cheers, drl
Although this will work well with come versions of awk , the standards say that the behavior of awk is unspecified if the value of RS contains more than one character. On OS X (and several other systems), the above code will just use "S" as the record separator and ignore the "OH".
Thanks for your valueble help. I did some investigation and tried all of your suggested solutions. So I finally solved my problem. Here is the solution that I used.
N='0'
ls *.[Tt][Xx][Tt] | while read line
do
cat $line > file$N
csplit -f $line file$N '/^SOH/' '{*}'
N=$((N+1))
done
The following should do exactly the same thing and user fewer system resources:
N='0'
for line in *.[Tt][Xx][Tt]
do
cat $line > file$N
csplit -f $line file$N '/^SOH/' '{*}'
N=$((N+1))
done
In cases where *.[Tt][Xx][Tt] expands to a long list of files, the for loop should still work correctly even though the ls command could fail due to ARG_MAX limitations.
Yes, your right. In this environment, all text files that we want to split are located in separate directory. And their names include date like 201410201653.INP, so I have changed list command to below.
cd /path/to/directory/
*
ls 20*.INP | while read line
Also output files of csplit named incorrectly. Example, 201410201653.INP01, ***02, **03 etc. So I'm going to rename them in script like 201410201653_01.INP
Also, I have added '-n 3' option to csplit command. Because csplit max output limit reached at 99 files.
csplit -f $line -n 3 file$N '/^SOH/' '{*}'
---------- Post updated at 05:16 PM ---------- Previous update was at 05:00 PM ----------