extract blocks of text from a file

cajunfries · May 10, 2009, 10:16pm

Hi,
This is part of a large text file I need to separate out.
I'd like some help to build a shell script that will extract the text between sets of dashed lines, write that to a new file using the whole or part of the first text string as the new file name, then move on to the next one and repeat.
The amount of text between the dashes is variable - might be just a couple of lines of text or many lines.
There's one line of space between the dashed line and the first line of text.
Doesn't matter to me if the new output file contains the dashes or not.
It would be nice to flag the ones with "No errors found" by appending that to the filename also, but not necessary.
Thanks!

Input file:

-----------------------------------------------------------------------

3D Survey MBST_BASIN/M93upd05_htti2_TTIvol2_Z (storage m93up5)
No errors found

-----------------------------------------------------------------------

3D Survey m93up5_ip/M93upd05_htti2_TTIvol2_Z (storage m93up5)
No errors found

-----------------------------------------------------------------------

3D Survey MARS_B/Mars-B (storage mars_b)
Seismic files referenced in Oracle not present on disk
This is an ERROR. Files listed below will not open in SeisWorks:

mars_b/mars_b01.3dv

-----------------------------------------------------------------------

3D Survey mars_b_ip/Mars-B (storage mars_b)
Seismic files referenced in Oracle not present on disk
This is an ERROR. Files listed below will not open in SeisWorks:

mars_b/mars_b01.3dv

-----------------------------------------------------------------------

3D Survey AUGER_123DI/szwauger (storage szwauger)
Seismic files referenced in Oracle not present on disk
This is an ERROR. Files listed below will not open in SeisWorks:

szwauger/S_AUGER_123DI_30601.3dh
szwauger/S_AUGER_123DI_30701.3dh
szwauger/S_AUGER_123DI_30801.3dh
szwauger/S_AUGER_123DI_30901.3dh
szwauger/S_AUGER_123DI_31001.3dh
szwauger/S_AUGER_123DI_31101.3dh
szwauger/S_AUGER_123DI_31201.3dh
szwauger/S_AUGER_123DI_31301.3dh
szwauger/S_AUGER_123DI_31401.3dh
szwauger/S_AUGER_123DI_31501.3dh
szwauger/S_AUGER_123DI_31601.3dh

-----------------------------------------------------------------------

2D Project szwauger_1p

-----------------------------------------------------------------------

Desired output :

file 1, named "3D Survey MBST_BASIN"

3D Survey MBST_BASIN/M93upd05_htti2_TTIvol2_Z (storage m93up5)
No errors found

file 2, named "3D Survey m93up5_ip"

3D Survey m93up5_ip/M93upd05_htti2_TTIvol2_Z (storage m93up5)
No errors found

file 3, named "3D Survey MARS_B"

3D Survey MARS_B/Mars-B (storage mars_b)
Seismic files referenced in Oracle not present on disk
This is an ERROR. Files listed below will not open in SeisWorks:

mars_b/mars_b01.3dv

and so on...

ghostdog74 · May 10, 2009, 11:04pm

if you have Python, here's an alternative

f=0
for line in open("file"):
    line=line.strip()
    if "---" in line:continue
    elif "3D Survey" in line:
        filename=line.split("/")[0]
        o=open(filename.replace(" ","."),"w")
        f=1
    if f:print >>o, line

output:

# ls -1 3D*
3D.Survey.AUGER_123DI
3D.Survey.MARS_B
3D.Survey.MBST_BASIN
3D.Survey.m93up5_ip
3D.Survey.mars_b_ip

# more 3D.Survey.mars_b_ip
3D Survey mars_b_ip/Mars-B (storage mars_b)
Seismic files referenced in Oracle not present on disk
This is an ERROR. Files listed below will not open in SeisWorks:

mars_b/mars_b01.3dv

# more 3D.Survey.MARS_B
3D Survey MARS_B/Mars-B (storage mars_b)
Seismic files referenced in Oracle not present on disk
This is an ERROR. Files listed below will not open in SeisWorks:

mars_b/mars_b01.3dv

cajunfries · May 11, 2009, 11:03am

Thanks - although I don't know what Python is, I might be able to adapt your code in a shell script.

ghostdog74 · May 11, 2009, 11:18am

Python is a scripting/programming language (much like Perl). anyway, that Python code is self explanatory so i don't think you will have much problems "converting" it to shell.

cajunfries · May 11, 2009, 12:12pm

Hi,
Thanks again, but just briefly skimming over your code, I see one small problem which will prevent it from working like I need - the line "elif "3D Survey" in line:" cannot be so specific for that text string.

I'll need some way to capture any and all text between the dashed lines, then use whatever comes up in the first line of text (before the slash) as the output filename. It's not always going to write that text "3D Survey" in the first text line - it might be anything. It's just an accident that my example showed that seemed to be consistent (sorry).
It changes later on in the file...

basically I need something like this

open file for reading
if dashed lines, then skip one line
read next line - extract text up to slash and store as filename
read lines and write to file
if dashed line encountered, close file
repeat

I'll have another look at the code later on today and see what I can figure out.

ghostdog74 · May 11, 2009, 12:44pm

another way, if your file is not too big, is to get everything into memory, then do a split on dashes+newline. after splitting, array will contain all the data the need. iterate the array to get the filenames, and write to output file accordingly.

import re
pat=re.compile("--*\n",re.M|re.DOTALL) #going to split the whole file by dash followed by \n
data=open("file").read()
data=pat.split(data)
data=[i.strip() for i in data if i != "" ] #remove extraneous data like blanks , newlines
for items in data:
    try:
        index_of_slash = items.index("/") #get the position where "/" is
    except:
        pass
    else:
        filename = items[:index_of_slash] #construct filename
        open(filename.replace(" ","."),"w").write(items)

output:

# ls -1 3D*
3D.Survey.AUGER_123DI
3D.Survey.MARS_B
3D.Survey.MBST_BASIN
3D.Survey.m93up5_ip
3D.Survey.mars_b_ip

# more 3D.Survey.AUGER_123DI
3D Survey AUGER_123DI/szwauger (storage szwauger)
Seismic files referenced in Oracle not present on disk
This is an ERROR. Files listed below will not open in SeisWorks:

szwauger/S_AUGER_123DI_30601.3dh
szwauger/S_AUGER_123DI_30701.3dh
szwauger/S_AUGER_123DI_30801.3dh
szwauger/S_AUGER_123DI_30901.3dh
szwauger/S_AUGER_123DI_31001.3dh
szwauger/S_AUGER_123DI_31101.3dh
szwauger/S_AUGER_123DI_31201.3dh
szwauger/S_AUGER_123DI_31301.3dh
szwauger/S_AUGER_123DI_31401.3dh
szwauger/S_AUGER_123DI_31501.3dh
szwauger/S_AUGER_123DI_31601.3dh

with the shell, you can use awk to get the same results....(incomplete code)

awk 'BEGIN{
 RS="---*\n\n"
 FS="/"
}{
 filename=$1
 if(filename !=""){
    print $0 >filename
 } 
}' file

cajunfries · May 11, 2009, 1:24pm

OK - thanks!
I'll sift thru all this and see what I can do. I imagine the awk might be the best solution. Don't know arrays at all.

cajunfries · May 18, 2009, 11:41am

Hello all,
I've been making some progress on splitting out this input file, but just can't figure it out completely.

Tried everything suggested, (sincere thanks to all who sent them) but couldn't get the array solution nor the awk nor the Python examples to work.

Here's what I've written so far. It does what I need at first - that is when it encounters a dashed line in the input, it skips the next blank line, then reads the next line of text, extracts a portion of that for the output filename (using 1st 20 char., compress spaces, and substiture slashes if they exist), but then I need it to read whatever text comes next into the output file just created, down to the next dashed lines it finds, then exit that loop and create the next output file as before.

First I tried another while loop (now commented out), but it just runs away and reads the whole file without stopping. I then tried just a second if statement I have inside the first, but this one reads only one more line of text and exits to the outer loop.

I know the way I do it might be clumsy, but it's all that I know and all that I've been able to find in the forums and on the internet. Can anyone help with this second loop? Thanks.

------------------------

#!/bin/ksh
file=$1
while read line; do

#extract first 10 characters to see if it's a dashed line delimiter
check=$(echo $line | awk '{ print substr( $0, 0, 10)}')
if [[ "$check" == "----------" ]]; then

#skip blank line
read line
read line

#extract first 20 characters of third line for output filename
path1=$(echo $line | awk '{ print substr( $0, 0, 20)}')

#remove any spaces
path2=$(echo $path1 | sed 's/ //g')

#if one exists, replace slash w/underscore
path=$(echo $path2 | sed 's/\//_/')

#create output file
echo $line > ${path}.stg_chk_out

############# THIS INNER LOOP NOT WORKING

#continue to read until next delimiter
####### while read line; do
check=$(echo $line | awk '{ print substr( $0, 0, 10)}')
if [[ "$check" != "----------" ]]; then
echo $line >> ${path}.stg_chk_out
read line
check=$(echo $line | awk '{ print substr( $0, 0, 10)}')
fi
###### done < $file
fi
done < $file