How to remove the delimiter from the column value within a file?

dJHa · September 27, 2016, 7:18am

Hello All,

we have some 10 files wherein we are using the ASCII NULL as separator which is nothing but '^@' and we need to change it to pipe delimited file before loading to database. Most of the data seems to be fine but there are instances where this separator tends to appear in the middle of column value like email address "asbc^@gmail.com

Below is the example of data, where second rows seems to have '^' in email address.

deptno dname location email
1^@Sales^@NY^@abc@gmail.com
2^@marketing^@WA^@xyz^@gmail.com
3^@Inventory^@NJ^@pqr@gmail.com

we have manage to convert the data into pipe separated file but the problem is how to you tackle this extra '^

cat -v raw_file.dat | sed 's/\^@/|/g' > parsed_raw_file.dat

Any help would be great.
Thanks
Djha

RudiC · September 27, 2016, 7:35am

I'm afraid this problem is not that easy to solve as it seems to come from a data/data entry error in the input file. We need to unambiguously identify the address field. Will records always consist of the same number of fields? Will the e-mail-addresses always be in the records' last field? Will they be the only fields containing a dot, or a common domain or set of domains?

Depending on what OS and sed version you use (which you fail to mention in your post), this might work for the conversion (without solving your problem yet):

sed 's/\x00/|/g' file
deptno dname location email
1|Sales|NY|abc@gmail.com
2|marketing|WA|xyz|gmail.com
3|Inventory|NJ|pqr@gmail.com

Don't you need the header line use the same separators?

dJHa · September 27, 2016, 9:27am

Hi Rudic,

Thanks for your comments. we have got 10 files and all the files will have different number of columns and this ^ issue can appear in any column within the file. I know I am making things a bit complicated but thats how our requirements are.

Thanks
Dharmendra

Tha

RudiC · September 27, 2016, 9:30am

Unless you give people something to work upon, I'm afraid we're out of options.

gandolf989 · September 27, 2016, 9:50am

You might be able to do a three pass solution where you change the ^@ in the email addresses first to something temporary, then change everything else, then go back and change the temporary characters to what you want. Are all of the email addresses .com? or are there any .edu or .org. That might help to simplify the problem.

RudiC · September 27, 2016, 9:59am

Why not just replace ^@ with @ ? The crucial thing is to identify the address fields.

rbatte1 · September 27, 2016, 11:54am

Is the data shown the actual data where the separator is two characters ^ followed by @ , or a screen scrape of the data where it is a single character being displayed as ^@ ?

Could you do this for us:-

head -1 raw_file.dat ; head -1 raw_file.dat | od -x

That should give us the actual text line and what it appears to be, otherwise we're left guessing what you have and we may spend ages going the wrong way.

Kind regards,
Robin

Don_Cragun · September 27, 2016, 4:22pm

It is the cat -v that makes ^@ ambiguous. If you want to change null bytes in a file to pipe symbols, use tr ; not cat and sed :

tr '\0' '|' < raw_file.dat > parsed_raw_file.dat

But, of course, if your data files contain other null bytes (such as in numeric values stored as type int or float instead of in strings), you will need something that actually parses the format of your .dat files and only changes the null bytes that are field delimiters in your data to pipe symbols.

blastit.fr · September 27, 2016, 4:52pm

Hi,
The tr command is THE solution is this case : it can translate easily any character like NUL , which is \0 for unix.
See :
( od -c is just here to show where the NUL is )

$ od -c n.tmp
0000000   a   a   a   a   \  \0   b   b   b  \n
0000012

$ tr '\0' '|' < n.tmp > n.txt

$ od -c n.txt
0000000   a   a   a   a   \   |   b   b   b  \n
0000012

blastit.fr · October 4, 2016, 5:55pm

Hi,
You can just use tr to replace null chars with anything you want , e.g. |
See:

$ od -c 1.tmp
0000000   w   e       h   a   v   e       n   u   l   l      \0  \0  \0
0000020       c   h   a   r   s  \n
0000027

$ tr \\000 \| < 1.tmp
we have null ||| chars