Replace char between chars - help needed

Hello,

I have a csv file with "^" as text delimiters and "|" as field delimiters. It's converted from a xls file. One record looks like this:

 ^Tablete Internet^|Archos|501838|^Tableta Internet ARCHOS 80 G9
...| ...
(more lines)
... "501|838"^|330.00|USD|sl|12|0|Link|^router wireless 150 Mbps^|^5 ron^|more|^other info^|

I have bolded the "|" which is delimiting the actual fields. I need to replace the "|" and newlines which can apear multiple times in the 4th field which is also delimited by the text delimiters "^".

The output must be a csv file with | as delimiters and without any new line in the 4th field so I can insert the fields into a mysql database.

My best result by now is:

awk -F '[\^\^]' '{gsub("\|",";",$2); $2; print}'

but that does not remove the new lines... and also changes some "|" needed in the last fields.

Any ideas?

I forgot to mention that i've also tried with sed and perl but with no luck.

why can't u just try to remove all the ^ symbols??

won't that looks like a file with just | being the delimiter? just a clarification!

Because I'll have too many fields then... the 4th field will be splitted in many...
I'll do that only after I have replaced the "|" from the 4th field...

Considering your 4th field to be

 ^Tableta Internet Archos ....^ 

I don't think this would cause any issues!!! Please correct me if I am wrong

My 4th field is:

|^Tableta Internet ARCHOS 80 G9
...| ...
(more lines)
... "501|838"^|

don't forget that this is only one record which splits on multiple lines...

And how do you make out the end/beginning of any record?

All records end with a new line. Each record has 12 fields separated by "|". If the field contains multiple words with spaces or new lines, then is enclosed with "^".
My problem is I'm not so experienced with regular expresions. I never neded to parse a field from a record wich splits multiple lines...

A quick link while searching through forum

remove newline characters between two delimiters

hope this helps you!! :slight_smile:

Thank you Pikk45! One problem solved!

Using

awk -F'\^\|' '{while(NF<4 && getline p)$0=$0 p}1'

on the file I can remove max 4 new lines on the 4th field.
The delimiter used is ^| which is at the end of the 4th field.

Now I must replace the "|" between the "^"...

---------- Post updated at 12:15 PM ---------- Previous update was at 09:47 AM ----------

I have come to the:

awk ' {for (i=1;i<=NF;i++) if (i!=NF) {printf (5<i && i<(NF-1))?$i"-":$i"|"} print $NF} ' FS='|' OFS='^'

but this replaces all "|" till the end... how can I tell awk to stop replacing at first "^|"?

---------- Post updated at 04:19 PM ---------- Previous update was at 12:15 PM ----------

Hello,

I finally found the answer. It's not so pretty, but it works and is extremely fast (a few seconds) on a almost 2Mb file:

cat csv_in_file | awk -F'\^\|' '{while(NF<4 && getline p)$0=$0 p}1' \
| sed 's/%/_procent_/g'\
| awk '{for (i=1;i<=NF;i++) if (i!=NF) {printf (3<i && i<(NF-1))?$i"!":$i"|"} print $NF}' FS='|' OFS='^' \
| perl -npe 's/^(.*)!(.*)!(.*)!(.*)!(.*)!(.*)!(.*)!(.*)!(.*)!(.*)!(.*)$/$1|$2|$3|$4|$5|$6|$7|$8|$9|$10|$11/' \
| sed 's/_procent_/%/g' > csv_out_file

I'll try to explain what every command does:

  • first awk sets the field delimiter to "^|" and deletes at most 3 new lines (I've discovered I have at most 4 lines of data in the 4th record = 3 new lines)
  • the sed replaces the "%" with "_precent_" because I found that it breaks somehow the next awk command...
  • the second awk command replaces all the "|" symbols beginning with the 4th field with "!" (I'm sure there will never be "!" in my last fields)
  • the perl command replaces the last "!" with "|"
  • the last sed replaces back the "%"

The result is a csv file with pipe as a delimiter wich I can use...

Maybe there is a more elegant solution but this works and is damn fast compared to the windows excel.
Hope this will help others too.

Can you post few lines of your csv file so that it we can analyze and try this in a single line?

This looks like a long long command, but yes it is indeed faster! :slight_smile:

I've selected 3 lines from the file:

^TV Tuner^|LogiLink|VG0011|^Stick USB pt. captura audio-video, Logilink ""VG0011""^|18.50|EUR|s|24|0|Link||scutit|||
^Tablete Internet^|GoClever|^TAB I71^|^Tableta GoClever TAB I71, InfoTMIC iMAPx220 1GHz (ARM11, GPU GC600), Android 2.3, 256MB DDR2, 4GB MLC NAND flash, 7" LCD TFT 800x480 Backlight - LED. 16:9 16bit culori.
Resistiv single-touch, Wi-Fi (802.11 b/g), AVI, 3GP, MP4, RM, RMVB, FLV, MKV, MOV, VOB, DAT, WMV, AVI,
 3GP, MP4, RM, RMVB, FLV, MKV, MOV, VOB, DAT, WMV, MP3|WAV|OGG|FLAC|APE|AAC, Conectori: 1x mini jack 3.5 mm,
 2 porturi USB 2.0, 4000 mAh lithium-ion polymer 3.7V^|89.00|USD|w|24|0|Link||^5 ron^|||
^Notebook - accesorii^|DeepCool|N17|^Stand notebook DeepCool 14"" - metal, fan, USB, dimensiuni 330X250X25mm, dimensiuni Fan 140X140X15mm, Fan Speed 100010%RPM, Max.
 Air Flow 47.35CFM, zgomot 21dB(A), Hydro Bearing, ''N17''^|12.20|USD|s|12|0|Link|^Promo achizitii saptamanale^|scutit|||

The source xls is UTF-8, if this matters.