Hi,
Is there anyway to find the junk characters in a file.Consider the file has data as given below:
123|abc^M|Doctor^C #record 1
234|def|Med #record 2
345|dfg^C|Wrong^V #record 3
The junk characters are highlighted and this is a pipe delimited file.
Is there anyway to find the records with junk characters and move it to a file and good records to a separate file.
Thanks in advance
Hi,
That doesnt seem to work. These are control characters which are generally not visible in VI mode.
If you know which special characters (like #,$,%) are permited for those lines, then you could use this:
awk '!/[^a-zA-Z|#$%]/' file
Add where marked any characters that are not "a-z" but are permited to be part of field's string.
Other way is:
awk '!/[[:cntrl:]]/' file
But it might not work with all AWK implementations.
methyl
June 25, 2010, 7:23am
5
Further to bartus11. Two passes to achive the required result.
awk '!/[[:cntrl:]]/' filename > good_data
awk '/[[:cntrl:]]/' filename > bad_data
Hi,
@methyl : It is not working.
The sample data is this:
15|Jsus
15|Susan
The first record has a special character in it which is actually a junk character ( ascii value is \032) which is invisible.
The above command doesnt catch that as a bad record.
Did you try that?
awk '!/[^a-zA-Z|#$%]/' file
[/COLOR]ya i have tried that.. still it is not working..
The ascii value is \032 . It will be seen as ^Z character in VI.
Ashiwn..did you try the second method suggested by bartus11? it should be working.
Guru.
Hey.. I tried that..
This is the result.It is printing the record with the junk characters as well.
$ awk '/[^a-zA-Z|#$%]/' junk1.txt
15|Susan|123
john|zint^Z|123
^Z is not printable character, so it can't be output by AWK. Maybe there are just two normal characters: "^" and "Z"? Try
cat junk1.txt
and see if "^Z" apears there. If it does, then it is not CTRL+Z, just regular "^" and "Z". Also you can post output of
xxd junk1.txt
so we can see exactly how does your test data look like.
ashwin3086:
Hi,
@methyl : It is not working.
The sample data is this:
15|Jsus
15|Susan
The first record has a special character in it which is actually a junk character ( ascii value is \032) which is invisible.
The above command doesnt catch that as a bad record.
Odd. It should match [:cntrl:].
$ printf '\032' | awk '/[[:cntrl:]]/ {print "MATCHED SUCCESSFULLY"}'
MATCHED SUCCESSFULLY
Regards,
Alister
---------- Post updated at 10:05 AM ---------- Previous update was at 10:01 AM ----------
A posix-compliant sed alternative which only needs to read the file once:
sed -n '/^[^[:cntrl:]]*$/{p;d;}; w junk' data > nojunk
Hi,
This is the output i get if i try xxd and "od -c ". If u see there is a special character after ziet ( 032) in the first record. Second record is good one.
$ od -c junk1.txt
0000000 2 0 6 1 | j o h n | z i e t 032 |
0000020 7 \n 1 2 3 4 | a s b c | b c f g
0000040 | 1 0 \n
0000044
$ xxd junk1.txt
0000000: 3230 3631 7c6a 6f68 6e7c 7a69 6574 1a7c 2061|john|ziet.|
0000010: 370a 3132 3334 7c61 7362 637c 6263 6667 7.1234|asbc|bcfg
0000020: 7c31 300a |10.
You rewrote the code wrong... it is not
awk '/[^a-zA-Z|#$%]/' junk1.txt
but
awk '!/[^a-zA-Z|#$%]/' junk1.txt
Hi,
I tried this command given below
---> awk '!/[^a-zA-Z|#$%]/' junk1.txt
Output of xxd is given -
$ xxd junk1.txt
0000000: 3230 3631 7c6a 6f68 6e7c 7a69 6574 1a7c 2061|john|ziet.|
0000010: 370a 3132 3334 7c61 7362 637c 6263 6667 7.1234|asbc|bcfg
0000020: 7c31 300a |10.
It is not able to capture the record which has junk character in it.When i run the above command on this file it gives 0 records. But it is supposed to give the record with junk character.
---------- Post updated at 07:51 AM ---------- Previous update was at 12:26 AM ----------
Any help on this??
Hi,
I have the file with the data as given below :
$ cat junk1.txt
2061|john|ziet|7
1234|good|boy|6
There is a invisible control character after ziet. This is seen below as 032 .
$ od -c junk1.txt
0000000 2 0 6 1 | j o h n | z i e t 032 |
0000020 7 \n 1 2 3 4 | g o o d | b o y |
0000040 6 \n
0000042
I am pick records having such characters in them?
I have tried these commands but its not working.
awk '!/[^a-zA-Z]/' junk1.txt
awk '!/[[:cntrl:]]/' junk1.txt
Can anyone help with this?
ashwin3086:
Hi,
I have the file with the data as given below :
$ cat junk1.txt
2061|john|ziet|7
1234|good|boy|6
There is a invisible control character after ziet. This is seen below as 032 .
$ od -c junk1.txt
0000000 2 0 6 1 | j o h n | z i e t 032 |
0000020 7 \n 1 2 3 4 | g o o d | b o y |
0000040 6 \n
0000042
I am pick records having such characters in them?
I have tried these commands but its not working.
awk '!/[^a-zA-Z]/' junk1.txt
awk '!/[[:cntrl:]]/' junk1.txt
Can anyone help with this?
Those commands should pick records that do NOT contain special characters. This:
awk '/[^a-zA-Z|]/' junk1.txt
should show records WITH that characters.
@bartus
i am getting the following output if i run that
]$ awk '/[^a-zA-Z|]/' junk1.txt
2061| john|ziet|7
1234|good|boy|6
It is giving both the records.
Ah ,there are number too ;), and spaces... If space is good character for you, then:
awk '/[^a-zA-Z0-9 |]/' junk1.txt
Hi,
This command is not capturing ^@ character (NUL character). Is there anyway i can capture that and send that record to a bad file?