How to remove alphabets/special characters/space in the 5th field of a tab delimited file?

Srithar · May 7, 2014, 4:53pm

Thank you for 4 looking this post.

We have a tab delimited file where we are facing problem in a lot of funny character. I have tried using awk but failed that is not working.
In the 5th field ID which is supposed to be a integer only of that file, we are getting corrupted data as below.
I want to remove the entire corrupted data for the corresponding row & replace it with empty value.

i am not sure what are these symbles are and what command can replace these funny junk chars.

Your suggestion are appreciated.

Example:

record 1 - 
"14"    "50603" "1012"  "123"      "12ռ4�Z{>�}ŪiY2���3�'���>N�C�7>S"      "19-Mar-2014 14:58:26" 
record 2 - 
"14"    "50603" "1012"  "37164455"      "ռ4�Z>S"     "19-Mar-2014 14:58:26"

Output Should be like:

"14"    "50603" "1012"  "123"      ""     "19-Mar-2014 14:58:26"  
"14"    "50603" "1012"  "37164455"      ""      "19-Mar-2014 14:58:26"

Don_Cragun · May 7, 2014, 5:52pm

You didn't say anything about removing numeric characters. Why isn't the output for the 5th field in record 1 "124237" ? Why isn't the output for the 5th field in record 2 "4" ?

rdrtx1 · May 7, 2014, 7:25pm

try:

awk -F"\t" 'NF>4{
  f=$5;
  sub("^\"", "", f);
  sub("\"$", "", f);
  f = (f ~ /^[0-9]*$/) ? f : "";
  $5="\"" f "\"";
  print;
}' infile

RudiC · May 8, 2014, 6:10am

For your request to remove field 5, this might suffice:

awk '{$5="\"\""}1' OFS="\t" file
"14"    "50603"    "1012"    "123"    ""    "19-Mar-2014    14:58:26"
"14"    "50603"    "1012"    "37164455"    ""    "19-Mar-2014    14:58:26"

But, as Don Cragun says, it might be worthwhile to consider repair instead of remove, and to try to track down the cause of the unwanted behaviour.

Srithar · May 8, 2014, 12:45pm

Hi DON,

We need to remove the entire field if it contains any funny characters. we dont need to maintain the numeric characters in that field if it contains funny characters. So the output of the 5 th field should be " " for both the cases.

Thanks !

Don_Cragun · May 8, 2014, 2:22pm

You could try something like:

awk '
BEGIN {	FS = OFS = "\t"
}
/^"/ {	if($5 !~ /^"[0-9]*"$/) $5 = "\"\""
	print
}' file

If you want to try this on a Solaris/SunOS system change awk to /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk .
If file contains:

record 1 - 
"14"	"50603"	"1012"	"123"	"12ռ4�Z{>�}ŪiY2���3�'���>N�C�7>S"	"19-Mar-2014 14:58:26"	
record 2 - 
"14"	"50603"	"1012"	"37164455"	"ռ4�Z>S"	"19-Mar-2014 14:58:26"
record 3 - 
"1"	"2"	"3"	"4"	"5"	"08-May-2014 11:14:59"

this will produce:

"14"	"50603"	"1012"	"123"	""	"19-Mar-2014 14:58:26"	
"14"	"50603"	"1012"	"37164455"	""	"19-Mar-2014 14:58:26"
"1"	"2"	"3"	"4"	"5"	"08-May-2014 11:14:59"

Srithar · May 8, 2014, 8:02pm

Hi DON,

Thanks for you reply. It helped me a lot. But that is not working for one scenario. If that corrupted records is splitted into 3 lines (as mentioned in example) then it is removing the entire data in the 5th field & also from the successive fields. Let's see some examples:

Input:

Record 1 :
"14"	"50603"	"1012"	"2131609"	"18��#��nz�S^l�����
a��`��Z�/�*��������ˮ7d_�gˉ�RB�nx����R�
9gd,�P�X�O"	"02-May-2014 04:11:54"

Expected Output:

Record 1 :
"14"	"50603"	"1012"	"2131609"	""	"02-May-2014 04:11:54"

Actual Output: (6th field is cutted from the row)

Record 1 :
"14"	"50603"	"1012"	"2131609"	""

Don_Cragun · May 8, 2014, 10:40pm

srithar:

Hi DON,

Thanks for you reply. It helped me a lot. But that is not working for one scenario. If that corrupted records is splitted into 3 lines (as mentioned in example) then it is removing the entire data in the 5th field & also from the successive fields. Let's see some examples:

Input:
Record 1 :
"14"	"50603"	"1012"	"2131609"	"18��#��nz�S^l�����
a��`��Z�/�*��������ˮ7d_�gˉ�RB�nx����R�
9gd,�P�X�O"	"02-May-2014 04:11:54"
Expected Output:
Record 1 :
"14"	"50603"	"1012"	"2131609"	""	"02-May-2014 04:11:54"		
Actual Output: (6th field is cutted from the row)
Record 1 :
"14"	"50603"	"1012"	"2131609"	""		

Your original problem did not say anything about multi-line records.
Your original problem clearly showed that the lines starting with Record in your input file were not supposed to be copied to your output file, but now you say that those lines should be copied to the output.
The script I gave you works perfectly for any input that you described in your original problem statement.

Before we make another attempt to help you, you need to clearly describe your input file format and what you want to appear in the output. Start by answering the following questions and then add any other information we need to know to help you get code that will do what you want:

Are input lines starting with Record supposed to be copied to the output?
Can <newline> characters appear in any field other than field 5?
Should the output ever have more than one line of output per input (multi-line) record?
Can <tab> characters ever appear in any field?
Can double-quote characters ( " ) ever appear in any field other than as the 1st and last characters in the field?
Do all input fields have double-quote characters as the 1st and last character in every field?
Are the <newline> characters in your input file data in that field, or is there a fixed input line length that adds <newline> characters to enforce the input line length maximum?
Can <newline> characters ever appear in an input record other than as the last character in a record and in character positions that are integral multiples of 80?
Is there a maximum number of characters in the output file format?

Srithar · May 12, 2014, 2:35pm

Hi Don,

Please find the answers for your queries below:

Are input lines starting with Record supposed to be copied to the output?
Can <newline> characters appear in any field other than field 5?
I just put that for understanding purpose. The first line of the file is always the HEADER (field names).The expected output should be like below:
text "ID" "ID2" "NUMBER" "ID4" "ID5" "DATE1" "14" "503" "1012" "314580" "173124" "02-May-2014 06:16:53" "14" "503" "1032" "247100" "143773" "02-May-2014 06:17:17" "15" "503" "1012" "247210" "142773" "02-May-2014 06:17:34" "14" "503" "1062" "122430" "17828" "02-May-2014 06:18:11" "14" "503" "1012" "-1" "" "02-May-2014 06:18:11" "15" "503" "1027" "-1" "" "02-May-2014 06:18:52"
Should the output ever have more than one line of output per input (multi-line) record?
No
Can <tab> characters ever appear in any field?
No
Can double-quote characters ( " ) ever appear in any field other than as the 1st and last characters in the field?
No
Do all input fields have double-quote characters as the 1st and last character in every field?
Yes
Are the <newline> characters in your input file data in that field, or is there a fixed input line length that adds <newline> characters to enforce the input line length maximum?
No
Can <newline> characters ever appear in an input record other than as the last character in a record and in character positions that are integral multiples of 80?
No
Is there a maximum number of characters in the output file format?
No

Don_Cragun · May 12, 2014, 4:02pm

Please show us a sample input file that should be transformed into that expected output. Nothing you have shown us so far matches the data in your expected output in message #9 in this thread.

Srithar · May 13, 2014, 1:55pm

In the below file you can see the 5th field (red color) is having the funny characters & that is splitted in multiple lines.

Input File:

"ID1"	"ID2"	"ID3"	"RD"	"NUM"	"DATE"	
"14"	"50603"	"1012"	"213093"	"18��#��nz�S^l�����
a��`��Z�/�*��������ˮ7d_�gˉ�RB�nx����R�
9gd,�P�X�O"	"02-May-2014 04:11:54"
"15"	"50603"	"1012"	"213093"	"180778699"	"02-May-2014 04:12:48"
"14"	"50603"	"1012"	"139793"	"16M�E��~�,
J:/E��	I��VԽ�ɬ����[��?�]GޱCM�7d_�B��t�a"	"02-May-2014 04:13:07"
"14"	"50603"	"1012"	"372886"	""	"02-May-2014 04:13:11"
"14"	"50603"	"1012"	"480831"	"235345"	"02-May-2014 03:04:03"
"14"	"50603"	"1012"	"183007"	"15RM�N���>w����"	"02-May-2014 03:03:53"

Expected Output File:

"ID1"	"ID2"	"ID3"	"RD"	"NUM"	"DATE"	
"14"	"50603"	"1012"	"213093"	""	"02-May-2014 04:11:54"
"15"	"50603"	"1012"	"213093"	"180778699"	"02-May-2014 04:12:48"
"14"	"50603"	"1012"	"139793"	""	"02-May-2014 04:13:07"
"14"	"50603"	"1012"	"372886"	""	"02-May-2014 04:13:11"
"14"	"50603"	"1012"	"480831"	"235345"	"02-May-2014 03:04:03"
"14"	"50603"	"1012"	"183007"	""	"02-May-2014 03:03:53"

Don_Cragun · May 14, 2014, 5:08am

You said that there would never be embedded tab characters in a field in your input file, but there is a tab in the middle of field 5 in the two line record on lines 6 and 7 in your latest sample input file.

As long as there aren't any embedded tab characters immediately before or after a double quote character, the following seems to do what you want. (However, it is strange that your input file has a trailing tab character on the first line in your sample input file.)

awk '
BEGIN {	FS = OFS = "\t"
}
{	# Accumulate lines until we have a line with six fields.
#	printf("Line %d, NF %d: %s\n", NR, NF, $0)
	while(gsub(/\"\t\"/, "&") < 5) {
		rc = (getline nl)
		if(rc != 1) {
			printf("Unexpected EOF: line %d, NF %d: %s\n", NR, NF, $0)
			exit 1
		}
		$0 = $0 nl
#		printf("Line %d added, NF %d, %s\n", NR, NF, $0)
	}
	# Convert embedded tabs...
	if(gsub(/[^"]\t|\t[^"]/, "<tab>")) {
#		printf("embedded tabs replaced: %s\n", $0)
	}
	if(NR > 1 && $5 !~ /^"[0-9]*"$/) $5 = "\"\""
	print
}' file2

Srithar · May 14, 2014, 7:07pm

THANKS a lot DON!! The code works perfect & gives the expected result.