Lines with strange characters and sed...

luiscarvalheiro · October 28, 2008, 7:27pm

Dear All:

I Have a bunch of files which I'd like to process with a shell script. The problem is that the files have strange characters in their headers, like

g8@L-000-MSG2__-ABCD________-FIRA_____-000001___-200806181330-__
e
Data from BLABLABLA, Instrument: BLABLA, Date: 2008/06/18 13:30Z
Row: 1078 Col: 1130 Lat: -22.267 Lon: 22.256 *** Something here ***

For my intents, I only need the information (in this case) from line 3 onwards. Sometimes this strange header occupies 2 lines, others 3...others...I don't know.

I made a very simple test, like

FILE=`find . -type f -name "FILENAME"`

for i in $FILE
do

FNOW=`echo $i`

#Cuts two first lines of the file
sed '1,2d' $FNOW > newfile
sed '/^$/d' -i newfile

HEADER=`head -1 newfile | cut -c1-4`
if [ "$HEADER" != "Data" ]
then
sed '1d' -i newfile
sed '/^$/d' -i newfile
fi

#A simple testing
HEADER2=`head -1 newfile | cut -c1-4`
echo ${HEADER2},${HEADER} >> test.txt

done

The problem is that.....sometimes i don't get to cut all the "strange" headers to obtain "clean" files, as you can see in some lines of test.txt

Data,@H
Data,
Data,Data
Data,@H

(etc)

So:
Is there any way to fulfill my intentions with sed? Maybe some "delete all the first lines until find the expression �Data�? Honestly, i don't know what else to try.

Thank you very much in advance

KenJackson · October 28, 2008, 10:03pm

I haven't tried this, but it may get you a step closer to what you want.

for f in $(find . -name SOMETHING*); do
    g="modified-$f"
    cp -iv $f $g
    while head -n1 $g|egrep -qv '^Data[[:print:]]{20}' && test -s $g; do
        sed 1d -i $g
    done
done

It loops through all the files found, makes a copy, and chops off the first line as long as it doesn't start with 'Data' followed by 20 printable characters.

Note that egrep -qv does not print any output (-q) and returns true if it does not (-v) find the regex. And test -s returns true if the file is greater than zero size.

luiscarvalheiro · October 31, 2008, 8:42am

KenJackson:

Thank You very much for your suggestion.
With some modifications, your idea worked perfectly for me.

Best Regards,
Luis

joeyg · October 31, 2008, 9:17am

Sometimes hard to test a file with strange characters without the file, but consider the following:

You may have seen where you can

echo "hello" | tr [:lower:] [:upper:]
HELLO

but there are others
[:alnum:] for printable characters
[:cntrl:] for control characters

Perhaps using one of the above might allow you to strip off the bad, or only carry forward the good characters.

wempy · October 31, 2008, 9:36am

and maybe investigate the strings utility, which should at least help winnow out a lot of the crud first.