Search & Replace regex Perl one liner to AWK one liner

Thanks for giving your time and effort to answer questions and helping newbies like me understand awk.

I have a huge file, millions of lines, so perl takes quite a bit of time, I'd like to convert these perl one liners to awk.

Basically I'd like all lines with ISA sandwiched between non-word characters on its own line

then I'd like to remove the first non-word character in front of "sandwiched" ISAs or put another way put "sandwiched" ISAs at the beginning of the line

perl -pi -e 's/[\W_]ISA[\W_]/\n$&/g' large_file 
perl -pi -e 's/^[\W_]ISA/ISA/g' large_file

How would I do this in awk? Thanks so much for help, I really do appreciate it. Please let me know if I can explain this more clearly or if you need data examples.

Thank you!!!!

Please few sample lines

Hi and thanks Danmero!

Here are a few sample lines ... I only want the lines with red ISA on a new line not the ones in purple ISA ... I know its a bit messy ... I can explain the logic/syntax of the file, if you'd like

          ISA~00~          ~00~          ~ZZ~SEND  MFG       ~ZZ~RECV MFG       ~110616~2235~U~00200~000003972~0~P~\
GS~FA~SEND  MFG~RECV MFG~20110616~2235~4075~X~004010
ST~997~00001
AK1~SH~4075
AK2~856~000008260
AK5~AISATF
AK9~A~00001~00001~00001
SE~006~00001
GE~00001~4075
IEA~00001~000003972&ISA!00!SEND DATA  !00!SEND DATA  !ZZ!SEND  PDCPO     !ZZ!RECV            !110616!1540!U!00401!000009564!0!P!:
GS!FA!SEND  PDCPO!RECV!20110616!1540!9669!X!004010
ST!997!000021081
AK1!SH!12738
AK9!A!1!1!1
SE!4!000021081
GE!1!9669
IEA~1~00000956`ISA~00~SEND DATA  ~00~SEND DATA  ~ZZ~SEND  PDCPO     ~ZZ~RECV            ~110616~1540~U~00401~000009565~0~P~:>GS~FA~SEND  PDCPO~RECV~20110616~1540~9670~X~004010>ST~997~000021082>AK1~SH~12739>AK9~A~1~1~1>SE~4~000021082>GE~1~9670>IEA~1~000009565

---------- Post updated 07-06-11 at 11:04 AM ---------- Previous update was 07-05-11 at 05:38 PM ----------

Thought I'd add some details on the file.

ISA, GS, ST, AK1, AK2, AK5, AK9, SE, GE, IEA are line headers and generally follow the same order. ISA is the beginning of the record, IEA is the end of the record. There are tens of thousands of records in a given file.

The file also has non-word character field seperators (ie ~ !), it also has line seperators (either a newline or non-word character, later an awk script will change all [\W] to newlines)

Try:

awk '{gsub("[^a-zA-Z]ISA[^a-zA-Z]","\n&")}1' file

and

awk '{sub("^[^a-zA-Z]ISA","ISA")}1' file
1 Like

Thanks so much bartus11 ... I really appreciate your time ... I see where I went wrong initially, I didn't use quotes and used sub instead of gsub

awk -mr 99999999 '{gsub("[^a-zA-Z0-9]ISA[^a-zA-Z0-9]","\n&")}1' junemthlyob
 >> junemthlyob.a

One of the ISA lines has 3163417 characters and I'm getting this error, do you have any suggestions on how to overcome this?

awk: gsub() result
 ISA`00`FT too big
 input record number 30369, file /dev/fs/C/UNIX/SFU/USER/IBOBX12/JUNEOB2/junemthlyob
 source line number 1

What system are you using?

Hi Bartus11 ... I'm using Window Services for Unix Interix Korn Shell

Do you have any way to test that code on some Linux machine? I made some simple test with file containing one line with 4 milion random characters and it ran successfully on Linux:

[root@rhel ~]# wc -c big2
4079997 big2
[root@rhel ~]# wc -l big2
1 big2
[root@rhel ~]# grep -o ".ISA." big2
|ISA|
[root@rhel ~]# awk -mr 99999999 '{gsub("[^a-zA-Z0-9]ISA[^a-zA-Z0-9]","\n&")}1' big2 > big3
[root@rhel ~]# wc -l big3
2 big3

As you can see big2 got cut into two lines as a result of that one-liner. I would really recommend trying Linux for that task.

Ok will do ... which flavor of Linux do you recommend ... Ubuntu? Fedora? something else?

---------- Post updated at 03:39 PM ---------- Previous update was at 03:37 PM ----------

PS: I don't have access to a unix box ... I'll have to install/run linux in windows

Any relatively updated distribution will do. For the ease of installation and later usage I would recommend Ubuntu installed in VirtualBox :slight_smile: