Find out special characters from xml file

Hi....I have a xml file which is having lots of special characters which I need to find out and put the distinct list of those into a text file. The list of special characters is not specific, it can be anything at different point of time.

Can anyone help me to find out the same and list out?

I'm using KSH script.

Thanks & Regards,
Krishanu Saha

The description seems pretty vague. How about a sample input file, with code tags, and the expected output?

OK. Let me explain again. I have a xml file. I need to find out the special characters from that xml file except A-Z, a-z, 0-9 and the following list of symbols -

_ (underscore) , (comma) () (first brackets) & (ampersand) ; (semi colon) {} (2nd brackets) % (percentage) + (plus) < (less than) > (greater than) / (front slash) : (colon) = (equal to) . (dot) ' ' (space) " (double quotes) - (hyphen) \ (backslash) $ (dollar) and * (asterisks).

Apart from the above list any characters, symbols should be considered as invalid and need to find out the same.

Does this awk serve your purpose?

awk '{gsub(/[a-zA-Z0-9_,()&;{}%+<>/:=. "\-\\$*]/,x)}NF' file.xml

Thanks. But its no working.

I have tried -

awk '{gsub(/[a-zA-Z0-9_,()&;{}%+<>/:=. "-\$*]/,x)}NF' jhfnfull.xml > pqr.txt

but got the following error messages -

awk: syntax error near line 1
awk: illegal statement near line 1
awk: syntax error near line 1
awk: illegal statement near line 1

Modified code:

awk '{gsub(/[a-zA-Z0-9_,()&;{}%+<>\/:=. "\-\\$*]/,x)}NF' file.xml

Note: Use nawk instead if you are on SunOS or Solaris

Thank you....This is working. Let me do some testing.

---------- Post updated at 08:30 PM ---------- Previous update was at 04:36 PM ----------

Another help I need.....Now I need to keep A-Z, a-z, 0-9 and the following symbols in the xml file and remove all other symbols which are not listed here.

_ (underscore) , (comma) () (first brackets) & (ampersand) ; (semi colon) {} (2nd brackets) [] (3rd brackets), % (percentage) + (plus) < (less than) > (greater than) / (front slash) : (colon) = (equal to) . (dot) ' ' (space) " (double quotes) ' (single quote) - (hyphen) \ (backslash) $ (dollar) @ (at the rate) and * (asterisks).

Can anyone please help me on this?

Regards,
Krishanu

$ cat temp.txt
AZaz09 _ , () & ; {} [] % + ` ~
< > / : = . " - \ $ @ * |?
$ sed 's/[][^A-Za-z0-9_,()&;{}%+<>/:= ."\$@*-]//g' temp.txt
`~
|?

Modified Yoda's command

awk '{gsub(/[^a-zA-Z0-9_,()&;{}\[\]%+<>\/:=. "\-\\$@*'\'']/,x)}NF' file.xml

--ahamed

Hello friend..I need your valuable help again. As per your suggestion, I've implemented the following command in my KSH script to remove all characters not listed in the command -

nawk '{gsub(/[^a-zA-Z0-9_,()&;{}\[\]%+#<>\/:=. "\-\\$@*~?!\`\^]/,x)}NF' file.xml

And this is working fine without any issue. Using this command, the single quote is also being removed as it is not listed here in the symbol list.

But now client wants to keep ' (single quote) in the xml file and I can not add the single quote symbol in this command.

Can you please help me how can I handle this situation? I do not want to loose this command as this resolved my issue almost. Please help.

Getting single quotes into awk is annoying on the commandline... You could put it into a file:

{gsub(/[^a-zA-Z0-9_,()&;{}\[\]%+#<>\/:=. "\-\\$@*~?!\`\^\']/,x)}NF

Then run it like nawk -f stripchars.awk inputfile > outputfile