finding junk characters

Hi,

Is there anyway to find the junk characters in a file.Consider the file has data as given below:

123|abc^M|Doctor^C #record 1
234|def|Med             #record 2
345|dfg^C|Wrong^V  #record 3

The junk characters are highlighted and this is a pipe delimited file.
Is there anyway to find the records with junk characters and move it to a file and good records to a separate file.

Thanks in advance

grep - v "\^" input_file

Hi,

That doesnt seem to work. These are control characters which are generally not visible in VI mode.

If you know which special characters (like #,$,%) are permited for those lines, then you could use this:

awk '!/[^a-zA-Z|#$%]/' file

Add where marked any characters that are not "a-z" but are permited to be part of field's string.

Other way is:

awk '!/[[:cntrl:]]/' file

But it might not work with all AWK implementations.

Further to bartus11. Two passes to achive the required result.

awk '!/[[:cntrl:]]/' filename > good_data
awk '/[[:cntrl:]]/' filename > bad_data

Hi,

@methyl : It is not working.
The sample data is this:
15|Jsus
15|Susan

The first record has a special character in it which is actually a junk character ( ascii value is \032) which is invisible.

The above command doesnt catch that as a bad record.

Did you try that?

awk '!/[^a-zA-Z|#$%]/' file

[/COLOR]ya i have tried that.. still it is not working..
The ascii value is \032 . It will be seen as ^Z character in VI.

Ashiwn..did you try the second method suggested by bartus11? it should be working.

Guru.

Hey.. I tried that..
This is the result.It is printing the record with the junk characters as well.

$ awk '/[^a-zA-Z|#$%]/' junk1.txt
15|Susan|123
john|zint^Z|123

^Z is not printable character, so it can't be output by AWK. Maybe there are just two normal characters: "^" and "Z"? Try

cat junk1.txt

and see if "^Z" apears there. If it does, then it is not CTRL+Z, just regular "^" and "Z". Also you can post output of

xxd junk1.txt

so we can see exactly how does your test data look like.

Odd. It should match [:cntrl:].

$ printf '\032' | awk '/[[:cntrl:]]/ {print "MATCHED SUCCESSFULLY"}'
MATCHED SUCCESSFULLY

Regards,
Alister

---------- Post updated at 10:05 AM ---------- Previous update was at 10:01 AM ----------

A posix-compliant sed alternative which only needs to read the file once:

sed -n '/^[^[:cntrl:]]*$/{p;d;}; w junk' data > nojunk

Hi,

This is the output i get if i try xxd and "od -c ". If u see there is a special character after ziet ( 032) in the first record. Second record is good one.

$ od -c junk1.txt
0000000 2 0 6 1 | j o h n | z i e t 032 |
0000020 7 \n 1 2 3 4 | a s b c | b c f g
0000040 | 1 0 \n
0000044

$ xxd junk1.txt
0000000: 3230 3631 7c6a 6f68 6e7c 7a69 6574 1a7c 2061|john|ziet.|
0000010: 370a 3132 3334 7c61 7362 637c 6263 6667 7.1234|asbc|bcfg
0000020: 7c31 300a |10.

You rewrote the code wrong... it is not

awk '/[^a-zA-Z|#$%]/' junk1.txt

but

awk '!/[^a-zA-Z|#$%]/' junk1.txt

Hi,

I tried this command given below

---> awk '!/[^a-zA-Z|#$%]/' junk1.txt
Output of xxd is given -

$ xxd junk1.txt
0000000: 3230 3631 7c6a 6f68 6e7c 7a69 6574 1a7c 2061|john|ziet.|
0000010: 370a 3132 3334 7c61 7362 637c 6263 6667 7.1234|asbc|bcfg
0000020: 7c31 300a |10.

It is not able to capture the record which has junk character in it.When i run the above command on this file it gives 0 records. But it is supposed to give the record with junk character.

---------- Post updated at 07:51 AM ---------- Previous update was at 12:26 AM ----------

Any help on this??

Hi,

I have the file with the data as given below :

$ cat junk1.txt
2061|john|ziet|7
1234|good|boy|6

There is a invisible control character after ziet. This is seen below as 032 .

$ od -c junk1.txt
0000000   2   0   6   1   |   j   o   h   n   |   z   i   e   t 032   |
0000020   7  \n   1   2   3   4   |   g   o   o   d   |   b   o   y   |
0000040   6  \n
0000042

I am pick records having such characters in them?

I have tried these commands but its not working.

 awk '!/[^a-zA-Z]/' junk1.txt
 awk '!/[[:cntrl:]]/' junk1.txt

Can anyone help with this?

Those commands should pick records that do NOT contain special characters. This:

awk '/[^a-zA-Z|]/' junk1.txt

should show records WITH that characters.

@bartus

i am getting the following output if i run that

]$ awk '/[^a-zA-Z|]/' junk1.txt
2061| john|ziet|7
1234|good|boy|6

It is giving both the records.

Ah ,there are number too ;), and spaces... If space is good character for you, then:

awk '/[^a-zA-Z0-9 |]/' junk1.txt

Hi,
This command is not capturing ^@ character (NUL character). Is there anyway i can capture that and send that record to a bad file?