Need help to delete special characters exists only at the end of the each record in UNIX file?

rakeshp · January 21, 2019, 12:51pm

Hi,

I have a file in unix with 15 columns.It consists special characters( # , $ , ^M , @ , * , % etc)at the end of the each record.I want to remove these special characters.I used the following:

Sed -e 's[^a-zA-Z|0-9]/ /g;s/  */ /g'

. But It is removing special characters exists everywhere in the file(begining,middle and end of the file). But I want to remove the special characters appearing only at the end of the each record. I do not want to delete the special characters which are present in the begining or middle of the record. The Unix operating system is AIX.Please do the needful.

Thanks
Rakesh

nezabudka · January 21, 2019, 1:06pm

Hi, rakeshp
Try this. Will only delete the last one

sed 's/[^\w]$//'

or. Will delete range of last characters

sed 's/[^\w]\+$//'

Don_Cragun · January 21, 2019, 1:46pm

rakeshp:

Hi,

I have a file in unix with 15 columns.It consists special characters( # , $ , ^M , @ , * , % etc)at the end of the each record.I want to remove these special characters.I used the following:
Sed -e �s[^a-zA-Z|0-9]/ /g;s/  */ /g'
. But It is removing special characters exists everywhere in the file(begining,middle and end of the file). But I want to remove the special characters appearing only at the end of the each record. I do not want to delete the special characters which are present in the begining or middle of the record. The Unix operating system is AIX.Please do the needful.

Thanks
Rakesh

Hi Rakesh,
Please note that UNIX shells and utilities are very picky about spelling, capitalization, and quoting. The utility named sed will not be found if you spell it with a leading capital letter (as in Sed as shown in your sample code above) and the single-quote characters surrounding the last argument to sed must be the single-quote character that is found as the unshifted character on the key that has the double-quote character as the shifted character on a US keyboard. The pretty-printed characters in the Sed command you showed us above will cause a syntax error and a mismatched quotes error.

Also note that ^M is two characters (one that is special in some, but not all, circumstances and one that is seldom special). Some utilities will print control characters as a two-character sequence (with ^M representing the <carriage-return> character), but you can't search for control characters using these two-character sequences in a regular expression.

Also note that your specification is ambiguous. Are you trying to get rid of a single special character at the end of a line or are you trying to get rid of all adjacent special characters at the end of a line? Please give us a more precise definition of what you're trying to do and show us a small, representative sample input file (in CODE tags) that shows us what the input files you will be processing look like and show us a corresponding sample output file (in CODE tags) that shows us the output you hope to produce. Producing a list of special characters ending with "etc." leaves us guessing at what you mean by special. Is your definition of special anything that is not an alphanumeric character in your current locale? If it is, please confirm that this is what you mean. If not, please provide a definitive definition of the term special character that you are using.

Since you're converting all non-alphanumeric characters to <space>s, you seem to be indicating that <space> is to be treated as a special character. Does that mean that a single <space> or all <space>s at the end of a line should be deleted?

Your code seems to be trying to change all occurrences of two or more adjacent <space> characters to a single <space> in your output. You didn't say anything about doing that in the description of what you're trying to do. Is this a requirement or just a side effect of the way you're converting non-alphanumeric characters to <space> as an intermediate step in your algorithm?

Hi nezabudka,
Note that rakeshp is using AIX. The GNU sed \w escape sequence to match a "word" in a BRE is not supported by AIX (and is not allowed by the POSIX standards for any version of sed ).

rakeshp · January 22, 2019, 2:05pm

Hi nezabudka,

Thank you for the reply. sed 's/[^\w]$//' . This code is working good so far in AIX. Can we use the same code in Linux as well?

Thanks
Rakesh

--- Post updated at 02:05 PM ---

Hi Don Cragun,

Thank you for the reply and corrections in the code. I am trying to get rid of single special character and all adjacent special characters at the end of a line. I am not trying to convert any non-alphanumeric characters to <space>.I am only trying to get rid of single special/adjacent special characters appearing at the end of the each record only.

Thanks
Rakesh

RudiC · January 22, 2019, 2:13pm

You weren't too far off in your own attempt. Try, after correcting the already commented on syntax errors and anchoring the regex at the line end,

sed 's/[^a-zA-Z|0-9]$/ /; s/  */ /g' file

Don_Cragun · January 22, 2019, 2:26pm

I'm surprised that \w is working for you in AIX, but I'm glad that nezabudka's suggestion is working for you. (Even if \w is recognized as special in a BRE, I would not have expected it to treat the vertical bar symbol ( | ) as a regular character. Note that the BRE in the substitute command you were using (after inserting the missing slash character / ) was:

s/[^a-zA-Z|0-9]/ /g

would match any character that was not lowercase alphabetic, was not uppercase alphabetic, was not numeric, and was not a vertical bar character and change each of the matched characters to a <space>. So, I assumed you want the vertical bar character to be treated as a normal character. Is \w really treating the vertical bar character the way you want it to?)

Something that should work with any standard sed (assuming that your definition of a special character is anything that is not an alphanumeric character and is not a vertical bar symbol) would be:

sed 's/[^[:alnum:]|]*$//'

Don_Cragun · January 22, 2019, 3:18pm

rudic:

You weren't too far off in your own attempt. Try, after correcting the already commented on syntax errors and anchoring the regex at the line end,
sed 's[^a-zA-Z|0-9]$/ /g; s/  */ /g' file

Hi RudiC,
I'm afraid that still won't work:

There is a <slash> character missing after the s in the substitute command.
That BRE will only match one "special" character at the end of a line.
And, it still coalesces sequences of multiple <space>s not found at the end of a line to a single <space>.

Point #1 above leads to a syntax error. The other two points don't meet the (still poorly specified) requirements for this problem.

Staying closer to what was given originally, we could try:

sed 's/[^a-zA-Z|0-9]*$/ /g; s/  *$//' file

but that just uses two substitutions to do what I think I did in post #6 with a single substitution.

You're usually better with sed than I am. Am I missing something here?

~ Don

RudiC · January 22, 2019, 4:25pm

Thanks for pointing out the remaining syntax error - didn't test the command before posting. Corrected in my above post. For your other questions - I don't know what exactly the requestor is after, as the spec is unclear, so I just corrected (partly, haha) the obvious, taking for granted the rest of the original regex. Let's wait for a statement of the requestor, or, even better, a data sample.

wisecracker · January 22, 2019, 4:29pm

You have specified pure ASCII characters but not said if your file(s) have had, say for example, utf-8 characters so...
...to expand on everyones' post so far, as 'sed' can be caught out with some UNICODE characters then:
IF you encounter UNICODE characters inside ANY part of your file this might work using Don's simplest version.
Longhand OSX 10.14.1, default bash terminal; BSD sed version unknown.
Note there are 6 UNICODE characters in the test string.

Last login: Tue Jan 22 21:16:39 on ttys000
AMIGA:amiga~> echo "abc\��/o*&&^%&^?*s123HGFi(*&*��*&�" | iconv -c -f utf-8 -t ascii | sed 's/[^[:alnum:]|]*$//'
abc\/o*&&^%&^?*s123HGFi
AMIGA:amiga~> _

rakeshp · January 29, 2019, 2:13am

Hi Don Cragun/RudiC/nezabudka/All,

Now I have different requirement,. Totally i have 20 columns in each record. I need to remove ^M characters at the end of the each record and special characters in specific column like say 15th column in each record.

I have tried this

sed 's/[^\w]$//g' | awk 'BEGIN{FS="|";OFS="|"}{gsub("[^[:alnum:]]","",$15);print }'.

But this command is not working for me. Please let me know how we can write the command for this

Thanks
Rakesh

nezabudka · January 29, 2019, 2:43am

Bad idea to use sed and awk together. Try this

awk 'BEGIN {FS="|"; OFS="|"} {sub("[^[:alnum:]]$", ""); gsub("[^[:alnum:]]", "", $15)} 1'

Don_Cragun · January 29, 2019, 4:59am

rakeshp:

Hi Don Cragun/RudiC/nezabudka/All,

Now I have different requirement,. Totally i have 20 columns in each record. I need to remove ^M characters at the end of the each record and special characters in specific column like say 15th column in each record.

I have tried this
sed 's/[^\w]$//g' | awk 'BEGIN{FS="|";OFS="|"}{gsub("[^[:alnum:]]","",$15);print }'.
But this command is not working for me. Please let me know how we can write the command for this

Thanks
Rakesh

I'm disappointed that you ignored my request for clarity in the description of your problem again and still won't show us a sample of what you're trying to do.

If we assume that when you say you say you need to "remove ^M characters at the end of each record", you really mean that you want to remove a single <carriage-return> character at the end of the record and that a <carriage-return> character exists at the end of every record you will process, then the first awk sub() call nezabudka suggested may meet that requirement. However, it might be safer to change:

sub("[^[:alnum:]]$", "");

in her suggestion to:

sub(/\r$/, "");

which will only remove a <carriage-return> character at the end of a record if there is one and won't take of chance of removing some other character if there is no <carriage-return>. If you mean that you want to remove all adjacent <circumflex> and <capital-latin-M> characters from the end of every record no matter how many of those characters appear at the end of each record, you need a completely different substitution:

sub(/[M^]*$/, "");

If we assume that by "special characters in specific column like say 15th column" you mean characters in the 15th field that are not alpha-numeric in the current locale, then the awk gsub() call nezabudka suggested will do what you want. But, of course, we really don't have any way to guess what characters you consider special since you haven't given us any definition for what you want that term to mean. And, we don't know if you really want to change the 15th field or some other field that contains data that is in some unspecified way similar to the data in the 15th field. If you mean that you wan't to specify an arbitrary field number to be processed by your awk or sed script, that of course would only be successfully met by using $15 only about 5 percent of the time in records containing twenty fields.

If nezabudka didn't correctly guess at what your requirements are, then PLEASE give us a clear English definition of what you are trying to do, show us a small representative sample input file, and show us the corresponding sample output file that contains the output you want to produce from that sample input file.

rakeshp · January 30, 2019, 3:04am

Thanks alot nezabudka. This code is working fine as of now.I will get back to you if there any more calrifications!

--- Post updated 01-30-19 at 03:04 AM ---

nezabudka/Don Cragun,

This code is working fine but it is removing space at the end of the record before the ^M , but i do not want that space to be removed.Please let me know if we can modify this code.

sample data

input

aa|bb|##$$abch^^$$|xy ^M

output

aa|bb|##$$abch^^$$|xy ^M

Thanks
Rakesh

RudiC · January 30, 2019, 3:17am

Sure that output is what you're after (with the ^M )?

Difficult to believe. The code removes ONE single non-alphanumeric character at EOL. ^M if it is there, space if it is not.
Include "space" in the set to be excluded, or try

awk 'BEGIN {FS="|"; OFS="|"} {sub("[[:cntrl:]]$", ""); gsub("[^[:alnum:]]", "", $15)} 1' file

nezabudka · January 30, 2019, 4:19am

if we only assume that the fifteenth field is the last at the same time

awk 'BEGIN {FS="|"; OFS="|"} {gsub("[^a-zA-Z0-9 ]", "", $FN)} 1' file

--- Post updated at 09:19 ---

awk 'BEGIN {FS="|"; OFS="|"} {sub("[\r]$", ""); gsub("[^a-zA-Z_0-9 ]", "", $15)} 1' file

gsub("[^[:alnum:][:space:]]"

Don_Cragun · January 30, 2019, 4:27am

Of course it is being removed. You refuse to define what you believe are special characters and the code that has been supplied assumes that all non-alphanumeric characters except (in some cases) the <vertical-bar> character are special (and that includes <space>). And you refuse to tell us whether ^M represents the two characters ^ and M or represent a single <carriage-return> character.

If you keep telling us that our code isn't working without answering our questions, we'll continue to make bad guesses about what you really mean and we'll all continue to be frustrated.

What command (including the utility name and the options you gave it) did you use to display the sample input and output you showed us in post #13 in this thread?

How do you expect a line containing three field delimiters to have twenty fields? The data you showed us in post #13 can't possibly be related to the problem you're trying to solve in this thread.

Please show us some representative sample input data and then show us the output you are hoping to get from that sample input.

Please answer our questions and help us help you! If you continue to refuse to answer our questions, it is obvious that we won't be able to guess at what you're really trying to do and we are all just wasting our time trying to help you.

rakeshp · January 30, 2019, 7:12am

Hi Don,

Except space,all other non numeric characters are treated as special characters. ^M represents two characters.

Thanks
Rakesh

--- Post updated at 07:10 AM ---

Actual data contains 16fields only, but I have shown here only the sample data where I cannot show the actual data here.

--- Post updated at 07:12 AM ---

I understand that I should show the actual sample data, But i cannot do that here

Don_Cragun · January 30, 2019, 9:09am

I never asked you to show us actual data; I only asked you to show us representative sample data. You could easily have taken a couple of examples of real data and replaced every occurrence of a numeric character with a "9" and replaced every occurrence of a lowercase alphabetic character with an "x" and every occurrence of an uppercase alphabetic character with an "X" (except for things like <circumflex><capital-latin-M> where you should have left the "M" as it appears in your real data.

Why you would need to tell us that there are 20 fields and that you want to remove special characters from field 15 when there are only 15 fields in your actual data makes absolutely no sense to me. Why you would explicitly want us to remove the characters "^" and "M" from the 15th field (which is also the last field in your actual data) and then remove any characters that are not numeric or <space> characters from the same field makes no sense to me. Since "^" and "M" are both non-<space> and non-<digit> characters, removing what you are calling "special" characters from the 15th field will remove "^" and "M" from the end of the record without adding any special code to just remove those two characters.

If the above accurately describes your real data and what you are trying to do to it (i.e., remove all characters that are not <space> and are not numeric from the last field of each record where each record contains 15 fields), then all you need is something like:

awk 'BEGIN {FS="|"; OFS="|"} {gsub("[^ [:digit:]]", "", $15)} 1'

If you want to remove all characters that are not <space> and are not numeric from the 15th field in each record where each record contain 15 or more fields, and you also want to remove the two character sequence ^M from the end of each input record, then you need something more like:

awk 'BEGIN {FS="|"; OFS="|"} {sub("[^]M$", ""); gsub("[^ [:digit:]]", "", $15)} 1' file

rakeshp · February 6, 2019, 1:31pm

Hi Don Cragun/RudiC/nezabudka/All,

As you suggested,Please see the sample data below replaced from actual data

XXXX99999999|x99999999999|999999999|X|X|99999999|99999999|X|99999999|99999999|99999999|9999999  999999|||X ^M
XXXX99999999|X99999999999|999999999|X|X|99999999|99999999|X|99999999|99999999|99999999|||X ^M
XXXX99999999|X99999999999|9999999999999|X|X|99999999|99999999|X|99999999|99999999|99999999||||X ^M
XXXX99999999|X99999999999|999999999|X|X|99999999|99999999|X|99999999|99999999|99999999||||X ^M
XXXX99999999|X9999999999999|999999999|X|X|99999999|99999999|X|99999999|99999999|99999999||||X ^M

Thanks
Rakesh Puli

Don_Cragun · February 6, 2019, 3:56pm

rakeshp:

Hi Don Cragun/RudiC/nezabudka/All,

As you suggested,Please see the sample data below replaced from actual data

XXXX99999999|x99999999999|999999999|X|X|99999999|99999999|X|99999999|99999999|99999999|9999999  999999|||X ^M
XXXX99999999|X99999999999|999999999|X|X|99999999|99999999|X|99999999|99999999|99999999|||X ^M
XXXX99999999|X99999999999|9999999999999|X|X|99999999|99999999|X|99999999|99999999|99999999||||X ^M
XXXX99999999|X99999999999|999999999|X|X|99999999|99999999|X|99999999|99999999|99999999||||X ^M
XXXX99999999|X9999999999999|999999999|X|X|99999999|99999999|X|99999999|99999999|99999999||||X ^M

Thanks
Rakesh Puli

Thank you. That is a good start.

We now know that your input looks like what you described in post #1 in this thread and that it does not look like what you described on post #10 (where there were 20 fields and two of those fields were to be changed) and it does not look like what you described in post #13 (where there were 4 fields and absolutely nothing was supposed to be changed) and it does not like what you described in post #17 update 1 (where there were 16 fields).

In addition to the number of fields changing, your description of what characters are to be considered "special" seems to vary from post to post.

If what you want to do is:

remove all adjacent special characters at the end of each record,
where a character is considered special if it is not a numeric character and it is not a <space> character,
the data you want to process is located in a file named file , and
you want the results of the above conversion to be written to standard output

then the following command should do what you want:

sed 's/[^ [:digit:]]*$//' file

This will not be sufficient if any other field is to be modified and it will not be sufficient if any other character is not to be treated as special.

Since you still have not shown us what output you hope to produce from the above sample input, we have no way of knowing which of your many different statements about what changes are to be made is the correct set of requirements. The above code produces the output:

XXXX99999999|x99999999999|999999999|X|X|99999999|99999999|X|99999999|99999999|99999999|9999999  999999|||X 
XXXX99999999|X99999999999|999999999|X|X|99999999|99999999|X|99999999|99999999|99999999|||X 
XXXX99999999|X99999999999|9999999999999|X|X|99999999|99999999|X|99999999|99999999|99999999||||X 
XXXX99999999|X99999999999|999999999|X|X|99999999|99999999|X|99999999|99999999|99999999||||X 
XXXX99999999|X9999999999999|999999999|X|X|99999999|99999999|X|99999999|99999999|99999999||||X

Note that the above output include a single <space> character at the end of each line.