Need to strip control-A characters from a column in a file

Hi All,

I currently have flat file with 32 columns. The field delimiter is cntl-A ( \x01). The file has been extracted from an oracle table using a datastage job. However, in the 6th field, the data contains additional control -A characters which came as a part of the table data.

I need some help in removing these control-A characters in just this 6th field alone.

I tried using sed command to replace the first 5 delimiters and last 24 delimiters with another delimiter , like a | , and then use tr to strip off the remaining control-A characters. But it is taking too long. Any help is appreciated.

Can we see an example of a sample file?
What OS are you using?
Which shell is preferred?

If it is taking too long then how big is this _flat_file_?

Here is a sample. I am using ',' as field delimiter instead of cntl-a.

1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0

In the above example, the values "IND,IA" and "CH,INA" are coming from the table.

The files are in .gz format and the sizes are around 12 GB each.

You could produce your own test file like this:

$ printf "%s\x01" {1..31} > infile
$ printf "32\n" >> infile
$ printf "%s\x01" {1..5} 6{A..E} {7..31} >> infile
$ printf "32\n" >> infile

Ignoring the 6th column in the OP you never mentioned that, (in this case), the third column(ish) might or might not have this delimiter.
Is this a random event or is it every pair as in your example?
Can we see your attempt please?

These rows are random. I used sed command to convert the intial delmiters to another value and then tried to strip the additinal chars from the required column

 
echo "1,A,USA,0" > test_input.dat
echo "2,B,GERMANY,0" >> test_input.dat
echo "3,C,IND,IA,0" >> test_input.dat
echo "4,D,CH,INA,0" >> test_input.dat
 
sed -i 's/,/|/' test_input.dat                        ## for first delimiter ##
sed -i 's/,/|/' test_input.dat                        ## for 2nd delimiter ##
rev test_input.dat  > test_input.dat_rev       ## reversing the record, since there might be multiple additional delimiters in the problematic column ##
sed -i 's/,/|/' test_input.dat_rev                 ## for the last delimiter in the original record ##
sed -i 's/,//g' test_input.dat_rev                ## this removes the additional delimiters in the required column  ##
rev test_input.dat_rev > test_input.dat      ## revering the file to its original form ##
sed -i 's/|/,/g' test_input.dat ## replacing the new delimiters with original delimiters ##
 

try something like this:

awk -F$(printf '\x01') '
NF>32{
   E=NF-32
   for(i=7;i<7+E;i++) $6=$6$i
   for(i=7;i<=32;i++) $i=$(i+E)
   NF=32
} 1' OFS=$(printf '\x01') infile

Using builtins and hardcoded for 5, (4 + 1), fields. Not sure how long this will take on such a huge single file though...
Your version would need IFS=$'\001' and hardcoded for 33, (32 + 1), fileds. escaped line breaks will be needed for your version.
HW I/O will be a huge hit...
OSX 10.7.5, default bash terminal.

#!/bin/bash
echo '1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0
1,A,USA,0
2,B,GERMANY,0
3,C,IND,IA,0
4,D,CH,INA,0' > /tmp/flatfile
# > /tmp/newflatfile
# Not saved IFS for this test, using hex vale of ",".
IFS=$'\x2C'
while read line
do
	line=($line)
	if [ ${#line[@]} -eq 5 ]
	then

		printf "${line[0]}$IFS${line[1]}$IFS${line[2]}${line[3]}$IFS${line[4]}\n" # >> /tmp/newflatfile
	else
		printf "${line[0]}$IFS${line[1]}$IFS${line[2]}$IFS${line[3]}\n" # >> /tmp/newflatfile
	fi
done < /tmp/flatfile
# cat /tmp/newflatfile

Results:-

Last login: Wed May 20 08:01:31 on ttys000
AMIGA:barrywalker~> cd Desktop/Code/Shell
AMIGA:barrywalker~/Desktop/Code/Shell> chmod 755 ff.sh
AMIGA:barrywalker~/Desktop/Code/Shell> ./ff.sh
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0
AMIGA:barrywalker~/Desktop/Code/Shell> _

I don't understand your logic here. If you have a file with 32 fields (or columns), then there should be 31 delimiters separating those fields. But, in the last paragraph of your description you talk about saving the 1st 5 delimiters and the last 24 delimiters. The 5 + 24 delimiters that you are saving would work if your input file had 30 fields; not 32 fields.

Not having any real sample data means that we can only guess at which number in the above is wrong.

It looks like wisecracker's code will work as long as there is no more than one added delimiter in the problem field (and that is all you demonstrated in your example), but I think you're saying that there can be zero or more unwanted delimiters in the problem field. (And, as he said, his script is easy for a 4 field test file, but gets awkward when you extend that logic to 30 or 32 fields.)

I tried using Chubbier_XL's code (with 32 changed to 4 globally and the printf "\x01"[/ICODEs changed to printf "," globally with a test file I set up similar to your sample using commas and didn't get the results I was expecting.

Perhaps this alternative approach will help:

#!/bin/ksh
# Define real field delimiters and number of delimiters that should appear
BadField=6		# Field that may contain delimiters in data
Delim=$(printf '\x01')	# Delimiter character
Nfields=32		# # of desired fields 
Unused='|'		# A character that never appears in the data

# Fake values for sample input...  Remove these lines when processing real data.
BadField=3
Delim=","
Nfields=4

# awk script to clean up field # BadField...
awk -v B="$BadField" -v D="$Delim" -v N="$Nfields" -v U="$Unused" '
BEGIN {	DERE = "[" D "]"
	UERE = "[" U "]"
}
{	n = gsub(DERE, U) # Get delim count and change them to unused chars.
	for(i = 1; i < B; i++)
		sub(UERE, D)	# Change one initial unused char back to delim
	for(i = n - N + 1; i > 0; i--)
		sub(UERE, "")	# Delete one unused (extra delim) from field B
	gsub(UERE, D)		# Change remaining unused chars back to delim
}
1				# Print updated lines
' file				# Specify input file

and a sample input file named file containing:

1,A,USA,0
2,B,GERMANY,0
3,C,IN,DIA,1
4,D,CHI,NA,1
5,E,A,B,C,D,E,F,G,6

it produces the output:

1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,1
4,D,CHINA,1
5,E,ABCDEFG,6

which seems to be what you are trying to do with your sample.

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk .

Although written and tested using a Korn shell, this should work with any shell that recognizes basic POSIX shell command substitution and parameter expansion syntax (e.g., ash , bash , dash , ksh , zsh , and many others; but not csh and its derivatives and not an original Bourne shell).

If this does do what you want with the sample data, remove the code shown in red and verify that the remaining settings for BadField and Nfields are correct and it should work for your files with ctrl-A as the field delimiter. Obviously, you need to unzip your input files and re-zip the output produced.

1 Like

Seems to work OK for me:

$ cat > infile2 <<EOF
> 1,A,USA,0
> 2,B,GERMANY,0
> 3,C,IND,IA,0
> 4,D,CH,INA,0
> EOF

$ awk -F$(printf ',') '
> NF>4{
>    E=NF-4
>    for(i=4;i<4+E;i++) $3=$3$i
>    for(i=4;i<=4;i++) $i=$(i+E)
>    NF=4
> } 1' OFS=$(printf ',') infile2
1,A,USA,0
2,B,GERMANY,0
3,C,INDIA,0
4,D,CHINA,0

Might be easier if I replace the hard-coded field numbers with variables:

  • STRIP=field number that contains extra FS chars and needs to be joined
  • FLDS=Required number of fields
awk -F$(printf ',') '
BEGIN{ FLDS=4; STRIP=3 }
NF>FLDS{
   E=NF-FLDS
   for(i=STRIP+1;i<STRIP+1+E;i++) $(STRIP)=$(STRIP)$i
   for(i=STRIP+1;i<=FLDS;i++) $i=$(i+E)
   NF=FLDS
} 1' OFS=$(printf ',') infile
1 Like

Hi Chubler_XL,
Yes, sorry. I didn't get nearly enough sleep the last couple of nights. I missed changing the 7s.

  • Don

Hi Don.

I have only just noticed that the OP might have extra delimiters in the same field but I was working on the OP's post #3.

The problem is not the coding but the time factor to check such a huge file when the extra coding is needed to test for this/these new requirement(s).

Maybe builtins are not the way to go but I will re-rwite using them when I get home from work tonight...

Thanks for your comments.

Bazza...

I completely agree that the submitter could have made it more clear whether only one "additional" field separator could be present in the line (since the examples only showed one), but the original post sounded to me like zero or more field separators could appear as part of the data in field 6 (which for some unknown reason was field 3 in the examples).

One thing you could do to speed up your script and make it more reliable would be to change:

while read line
do
	line=($line)
	if [ ${#line[@]} -eq 5 ]
	then

		printf "${line[0]}$IFS${line[1]}$IFS${line[2]}${line[3]}$IFS${line[4]}\n" # >> /tmp/newflatfile
	else
		printf "${line[0]}$IFS${line[1]}$IFS${line[2]}$IFS${line[3]}\n" # >> /tmp/newflatfile
	fi
done < /tmp/flatfile

to:

while read line
do
	line=($line)
	if [ ${#line[@]} -eq 5 ]
	then	printf '%s\n' "${line[0]}$IFS${line[1]}$IFS${line[2]}${line[3]}$IFS${line[4]}"
	else	printf '%s\n' "${line[0]}$IFS${line[1]}$IFS${line[2]}$IFS${line[3]}" 
	fi
done < /tmp/flatfile # >>/tmp/newflatfile

Adding the format strings to the printf statements protects your script in case an input line contains any percent-sign or backslash characters. And, moving the redirections to the end of the loop (assuming that the #s commenting out the redirections will be removed at some point) instead of on each printf in the loop should speed things up. The open() and close() calls are fast when compared to the fork() and exec() needed to invoke an external utility, but doing millions of them to write a multi-Gb file when only one of each is needed will make a significant difference in your script's running time.

Note also that the submitter hasn't given any indication of what OS or shell are being used. The awk utility is universally available on UNIX-like systems. But, since array handling is not required in the shell by the standards, I tend to avoid using shell arrays in suggestions until I've determined that the submitter is using a shell that supports arrays. (This is just a personal preference. There is no reason why you should avoid arrays in code you suggest as long as you specify what shell you're using, as you always do.)

1 Like