We have a tab delimited file where we are facing problem in a lot of funny character. I have tried using awk but failed that is not working.
In the 5th field ID which is supposed to be a integer only of that file, we are getting corrupted data as below.
I want to remove the entire corrupted data for the corresponding row & replace it with empty value.
i am not sure what are these symbles are and what command can replace these funny junk chars.
Your suggestion are appreciated.
Example:
record 1 -
"14" "50603" "1012" "123" "12ռ4�Z{>�}ŪiY2���3�'���>N�C�7>S" "19-Mar-2014 14:58:26"
record 2 -
"14" "50603" "1012" "37164455" "ռ4�Z>S" "19-Mar-2014 14:58:26"
You didn't say anything about removing numeric characters. Why isn't the output for the 5th field in record 1 "124237" ? Why isn't the output for the 5th field in record 2 "4" ?
We need to remove the entire field if it contains any funny characters. we dont need to maintain the numeric characters in that field if it contains funny characters. So the output of the 5 th field should be " " for both the cases.
Thanks for you reply. It helped me a lot. But that is not working for one scenario. If that corrupted records is splitted into 3 lines (as mentioned in example) then it is removing the entire data in the 5th field & also from the successive fields. Let's see some examples:
Your original problem did not say anything about multi-line records.
Your original problem clearly showed that the lines starting with Record in your input file were not supposed to be copied to your output file, but now you say that those lines should be copied to the output.
The script I gave you works perfectly for any input that you described in your original problem statement.
Before we make another attempt to help you, you need to clearly describe your input file format and what you want to appear in the output. Start by answering the following questions and then add any other information we need to know to help you get code that will do what you want:
Are input lines starting with Record supposed to be copied to the output?
Can <newline> characters appear in any field other than field 5?
Should the output ever have more than one line of output per input (multi-line) record?
Can <tab> characters ever appear in any field?
Can double-quote characters ( " ) ever appear in any field other than as the 1st and last characters in the field?
Do all input fields have double-quote characters as the 1st and last character in every field?
Are the <newline> characters in your input file data in that field, or is there a fixed input line length that adds <newline> characters to enforce the input line length maximum?
Can <newline> characters ever appear in an input record other than as the last character in a record and in character positions that are integral multiples of 80?
Is there a maximum number of characters in the output file format?
Are input lines starting with Record supposed to be copied to the output?
Can <newline> characters appear in any field other than field 5?
I just put that for understanding purpose. The first line of the file is always the HEADER (field names).The expected output should be like below: text "ID" "ID2" "NUMBER" "ID4" "ID5" "DATE1" "14" "503" "1012" "314580" "173124" "02-May-2014 06:16:53" "14" "503" "1032" "247100" "143773" "02-May-2014 06:17:17" "15" "503" "1012" "247210" "142773" "02-May-2014 06:17:34" "14" "503" "1062" "122430" "17828" "02-May-2014 06:18:11" "14" "503" "1012" "-1" "" "02-May-2014 06:18:11" "15" "503" "1027" "-1" "" "02-May-2014 06:18:52"
Should the output ever have more than one line of output per input (multi-line) record?
No
Can <tab> characters ever appear in any field?
No
Can double-quote characters ( " ) ever appear in any field other than as the 1st and last characters in the field?
No
Do all input fields have double-quote characters as the 1st and last character in every field?
Yes
Are the <newline> characters in your input file data in that field, or is there a fixed input line length that adds <newline> characters to enforce the input line length maximum?
No
Can <newline> characters ever appear in an input record other than as the last character in a record and in character positions that are integral multiples of 80?
No
Is there a maximum number of characters in the output file format?
No
Please show us a sample input file that should be transformed into that expected output. Nothing you have shown us so far matches the data in your expected output in message #9 in this thread.
You said that there would never be embedded tab characters in a field in your input file, but there is a tab in the middle of field 5 in the two line record on lines 6 and 7 in your latest sample input file.
As long as there aren't any embedded tab characters immediately before or after a double quote character, the following seems to do what you want. (However, it is strange that your input file has a trailing tab character on the first line in your sample input file.)