Help search and replace hex values only in specific files

verge · June 14, 2011, 6:56pm

perl -pi -e 's/\x00/\x0d\x0a/g' `grep -l $'[\x00]GS' filelist`

This isn't working , it's not pulling the files that contain the regex. Please help me rewrite this :wall:.

Ideally for this to work on 9K of 20K files in the directory, I've tried this but I don't know enough about awk to make it work.

for each del in `cat delimiters`
do
  perl -pi -e 's/\x$del/\x0d\x0a/g' `grep -l $'[\x$del]GS' filelist`
done
cat delimiters
15
1c
1e
1f

GOAL
Subsititute the value at position 106 with a newline for the entire file (preferably when the value at pos 107=G) and preferably this will happen on for files that have a value at position 107 and the line begins with ISA.

ISSUE

my grep statement looking for specific hex values isn't working
I'd like to identify and perform the same action on files that share the same value at position 106
some characters are troublesome in a perl one liner or a grep like \ or - or other hex values
the file contains other non-ascii or non-printable characters throughout the file that should not be substituted
This value at position 106
changes from file to file
is consistent within a particular file
is non-word, usually non-ascii or non-printable (ie hex value 15, 1c, 1e, 1f, 21, 27, 2a, 2e, 3c, 3d, 3e, 3f, 40, 5c, 5e, 60, 7d, 7e, b8, be, c4, 00)
may not be at position 106 in other files and should not be subsituted in files that do not have this value at pos 106

INPUT:  This is what the input file looks like.  The value at position 106 is Hex x00 

ISA`00`FTL DATA  `00`FTL DATA  `ZZ`BBBB MFG       `ZZ`FFF  MFG       `110612`1931`U`00401`000001527`0`P`> GS`FA`BBBB MFG`FFF  MFG`110612`1931`940`X`002040 ST`997`000001184 AK1`PO`214 AK2`830`000007903 AK5`A AK9`A`1`1`1 SE`6`000001184 GE`1`940 IEA`1`000001527-
 
or
 
The value at position 106 is Hex x27.  Hex x60 is throughout the file and shouldn't be substituted. 

ISA`00`FTL DATA  `00`FTL DATA  `ZZ`B715 MFG       `ZZ`FTL  MFG       `110612`1931`U`00401`000001527`0`P`>'GS`FA`B715 MFG`FTL  MFG`110612`1931`940`X`002040'ST`997`000001184'AK1`PO`214'AK2`830`000007903'AK5`A'AK9`A`1`1`1'SE`6`000001184'GE`1`940'IEA`1`000001527-
 
OUTPUT:  This is what the output should look like.

ISA`00`FTL DATA  `00`FTL DATA  `ZZ`BBBB MFG       `ZZ`FFF  MFG       `110612`1931`U`00401`000001527`0`P`>
GS`FA`BBBB MFG`FFF  MFG`110612`1931`940`X`002040
ST`997`000001184
AK1`PO`214
AK2`830`000007903
AK5`A
AK9`A`1`1`1
SE`6`000001184
GE`1`940
IEA`1`000001527-

mirni · June 14, 2011, 9:02pm

Sounds like a fun project

Let's break it up:

convert to hex
store character 106 and 107
in hex format replace all char106's to '0a' if char107 is 'G'
convert hex back to ascii

Here, try this:

#!/bin/bash

while read file ; do           #loop through all input files
    od -t x1 $file > hexfile   #dump the content in hex 1-bytes
    awk 'NR==7{print $11 " " $12; exit}' hexfile |  #get char 106 and 107
    while read c106 c107 ; do  #and store them in variables
      if [ $c107 = 47 ] ; then #char107 is 'G'; substitute
         sed -i "s/$c106/0a/g" hexfile #replace all in hexfile
      fi
    done #end while
    awk '{$1=""; gsub(/ /,"")}1' hexfile | perl -ne '$l=length($_)-1; print pack(H.$l,$_);' > ${file}.replaced 
done < filelist

'od -t x1' will dump the contents of (ascii) input in 1-byte hexadecimals, check 'man od' for details, or try it out on CL.
The second awk command gets rid of first column, which is just reference numbers output by 'od', and also rids of all spaces. Perl command then determines length of hex string on each line and uses 'pack' to convert back to ascii. (This will be basically 'pack(H32,$)' except for last line which may be shorter, in which case pack(H32,$) would fill it up with \0. To avoid this mess at the end, length of string is determined.
Output into .replaced so that you don't mess up your originals and can compare.

verge · June 16, 2011, 2:40pm

Mirni!!

This worked like a charm! it worked very well and solved the hex issue and created newlines really well.

I'm not familiar with awk ... I tried to change up the if statement but it wouldn't work

while read c106 c107 c108 ; do  #and store them in variables
      if [ $c107 = 47 && $c108 = 53 ] ; then #char107 is 'G' and #char108 is 'S' sub
         sed -i "s/$c106/0a/g" hexfile #replace all in hexfile
      fi
done

ideally I'd like to also add if c109 is non-alphanumeric [^[:alnum:]]

how would I do that?

mirni · June 16, 2011, 6:11pm

Glad it worked.
c109 is always alphanumeric, because its a hex code of the character (the whole file 'hexfile' that awk filters is made of hex codes).

So you wanna get char109 from original file and test for alnum:

while read file ; do 
    c109=`awk 'NR==1{print substr($0,109,1)}' $file` #get char 109 from the first line of current file
    if [[ "$c109" = [^[:alnum:]] ]] ; then #only go on if c109 is not alnum
       od -t x1 $file > hexfile        
       awk 'NR==7{print $11 " " $12 " " $13; exit}' hexfile | 
       while read c106 c107 c108; do 
          if [ $c107 = 47 ] && [ $c108 = 53 ] ; then
             sed -i "s/$c106/0a/g" hexfile 
          fi     
       done #end while     
       awk '{$1=""; gsub(/ /,"")}1' hexfile | perl -ne '$l=length($_)-1; print pack(H.$l,$_);' > ${file}.replaced 
    fi  #end c109 condition
done < filelist

This will only work correctly if the original file has no newline characters before 109; since the awk command that grabs c109 operates only on the first line (NR==1).

verge · June 16, 2011, 9:37pm

Thanks so much mirni ... I really appreciate it.

Now that I've seen your code, this solves more than one issue for me. If c109 is present then yes ... all the code needs to be executed.

This saves time because the field will be empty for files that don't require this code.

Thanks!

Brackets! But of course ...

---------- Post updated at 06:37 PM ---------- Previous update was at 06:28 PM ----------

Can you help me figure out the next part of my script? How I can use awk to print a new line with every occurence of AK2 (plus the other variables). I can show you the awk I've already started, it isn't quite right, it's only giving me one line w the first AK2($2) and it skips the others.

INPUT looks like this:
ISA~00~          ~00~          ~ZZ~RRRR           ~ZZ~FFF FIAC       ~110611~2215~U~00301~000002391~0~P~>
GS~FA~RRRR MFG~FFFXMFG~110611~2215~1847~X~002000
ST~997~1751
AK1~PO~970
AK2~830~000031588 #I want to print the other variables for every occurrence of AK2
AK5~A
AK2~830~000031589
AK5~A
AK2~830~000031590
AK5~A
AK2~830~000031607
AK5~A
AK9~A~186~186~186
SE~376~1751
GE~1~1847
IEA~1~000002391
 
I'd like the OUTPUT to look this:
RRRR           ,FFF FIAC       ,000002391,FA,RRRR MFG,1847,PO,970,830,000031588,,,,,,A,,A,186,186,186,,
RRRR           ,FFF FIAC       ,000002391,FA,RRRR MFG,1847,PO,970,830,000031589,,,,,,A,,A,186,186,186,,
RRRR           ,FFF FIAC       ,000002391,FA,RRRR MFG,1847,PO,970,830,000031590,,,,,,A,,A,186,186,186,,
RRRR           ,FFF FIAC       ,000002391,FA,RRRR MFG,1847,PO,970,830,000031607,,,,,,A,,A,186,186,186,,

mirni · June 17, 2011, 4:23pm

I'm not sure I understand what you mean there... or what exactly is the logic of extraction, but let me try:

Your input seems to have fields separated by '~'. awk does the processing line-by-line, so you can tell it to print 3rd field of each line starting with 'AK2' like this:

awk -F~ '/^AK2/{print $3}' input

Now to put it together ad-hoc, something like this could be done by combining more pattern-action rules:

$ awk -F~ '
  /^ISA/{out=$7","$9","$14}
  /^GS/{out=out","$2","$3","$7}
  /^AK1/{out=out","$2","$3}
  /^AK2/{print out","$2","$3}
'  input
RRRR           ,FFF FIAC       ,000002391,FA,RRRR MFG,1847,PO,970,830,000031588
RRRR           ,FFF FIAC       ,000002391,FA,RRRR MFG,1847,PO,970,830,000031589
RRRR           ,FFF FIAC       ,000002391,FA,RRRR MFG,1847,PO,970,830,000031590
RRRR           ,FFF FIAC       ,000002391,FA,RRRR MFG,1847,PO,970,830,000031607

It works like this:
/regexPattern/{action to do when line matches regexPattern}

verge · June 28, 2011, 5:22pm

Thanks for your time and solution mirni ... I really appreciate both!

I apologize for not explaining the logic more clearly and you understood exactly what I meant.

Both of your solutions worked really well.

Thanks again

mirni · June 28, 2011, 5:33pm

Glad it worked
No apologies needed -- we all learn along the way...
Cheers