we have some 10 files wherein we are using the ASCII NULL as separator which is nothing but '^@' and we need to change it to pipe delimited file before loading to database. Most of the data seems to be fine but there are instances where this separator tends to appear in the middle of column value like email address "asbc^@gmail.com
Below is the example of data, where second rows seems to have '^' in email address.
I'm afraid this problem is not that easy to solve as it seems to come from a data/data entry error in the input file. We need to unambiguously identify the address field. Will records always consist of the same number of fields? Will the e-mail-addresses always be in the records' last field? Will they be the only fields containing a dot, or a common domain or set of domains?
Depending on what OS and sed version you use (which you fail to mention in your post), this might work for the conversion (without solving your problem yet):
sed 's/\x00/|/g' file
deptno dname location email
1|Sales|NY|abc@gmail.com
2|marketing|WA|xyz|gmail.com
3|Inventory|NJ|pqr@gmail.com
Don't you need the header line use the same separators?
Thanks for your comments. we have got 10 files and all the files will have different number of columns and this ^ issue can appear in any column within the file. I know I am making things a bit complicated but thats how our requirements are.
You might be able to do a three pass solution where you change the ^@ in the email addresses first to something temporary, then change everything else, then go back and change the temporary characters to what you want. Are all of the email addresses .com? or are there any .edu or .org. That might help to simplify the problem.
Is the data shown the actual data where the separator is two characters ^ followed by @ , or a screen scrape of the data where it is a single character being displayed as ^@ ?
Could you do this for us:-
head -1 raw_file.dat ; head -1 raw_file.dat | od -x
That should give us the actual text line and what it appears to be, otherwise we're left guessing what you have and we may spend ages going the wrong way.
It is the cat -v that makes ^@ ambiguous. If you want to change null bytes in a file to pipe symbols, use tr ; not cat and sed :
tr '\0' '|' < raw_file.dat > parsed_raw_file.dat
But, of course, if your data files contain other null bytes (such as in numeric values stored as type int or float instead of in strings), you will need something that actually parses the format of your .dat files and only changes the null bytes that are field delimiters in your data to pipe symbols.
Hi,
The tr command is THE solution is this case : it can translate easily any character like NUL , which is \0 for unix.
See :
( od -c is just here to show where the NUL is )
$ od -c n.tmp
0000000 a a a a \ \0 b b b \n
0000012
$ tr '\0' '|' < n.tmp > n.txt
$ od -c n.txt
0000000 a a a a \ | b b b \n
0000012