Deleting consecutive equal values in a file

Hello everyone,

I have a requirement as shown below. I need to delete consecutive same values from the source file and give it as output file.

Source:

a,b,c,d,e,e,f,g
Target:

a,b,c,d,f,g

The repeating value "e" should be deleted from the file completely. How can I achieve this through Unix script... thanks in advance.

Note: The position of the repeating string may not be same always. It may change.

Hello Vamsikrishna928,

Following may help you in same.

awk -F"," '{for(i=1;i<=NF;i++){if($i == V){$i="\b"} if(i==NF){ORS="";print $i"\n"} else {print $i;V=$i}}}' ORS="," Input_file

Output will be as follows.

a,b,c,d,e,f,g

Thanks,
R. Singh

Thanks for the reply..

But the output should completely eliminate the matching string "e". It output file should be like

a,b,c,d,f,g

Hello Vamsikrishna928,

Following may help you, not that handy but it will provide the output as requirement.

awk -F"," '{for(i=1;i<=NF;i++){if($i == V){$i="\b\b\b"} if(i==NF){ORS="";print $i"\n"} else {print $i;V=$i}}}' ORS="," Input_file

Output will be as follows.

a,b,c,d,f,g

Thanks,
R. Singh

Try

awk -F, 'NF{s="";for(i=1; i<=NF; i++){ if($i == $(i+1)){i+=1; continue } s = s (length(s)?OFS:"") $i }print s}' OFS=',' infile
$ echo 'a,b,c,d,e,e,f,g' | awk -F, 'NF{s="";for(i=1; i<=NF; i++){ if($i == $(i+1)){i+=1; continue } s = s (length(s)?OFS:"") $i }print s}' OFS=','
a,b,c,d,f,g

Try:

sed 's/\([^,]\{1,\}\),\1,\{0,1\}//g' file

Or GNU -r / BSD -E:

sed -E 's/([^,]+),\1,?//g' file

Or perl

perl -pe 's/(.+?),\1,?//g' file

Not to rain on anyone's parade, but does consecutive == 2 occurrences?

perl -pe 's/(.+?)(,\1)+,?//g'

(stolen, er, adapted from Scrutinizer)

---------- Post updated at 05:38 AM ---------- Previous update was at 05:21 AM ----------

And I missed the boundary errors:

echo 'a,b,c,d,e,e,e,ee,f,g,g,gg' | perl -pe 's/(.+?)(,\1)+,?//g'
a,b,c,d,e,f,g

So let's go with:

echo 'a,b,c,d,e,e,e,ee,f,g,g,gg' | perl -pe 's/$/,/; s/([^,]+)(,\1)+,//g; s/,$//;'
a,b,c,d,ee,f,gg

Thanks everyone for your response.

I tried with

sed 's/\([^,]\{1,\}\),\1,\{0,1\}//g' file

But the output is coming in such a way that it deletes all the similar strings in a complete row.

For example,
Source:

BS0000, BS0000 solution, CS0000, CS0000, CS0000 InterCompany

With the above code, target coming as:

solution,Intercompany

The requirement is to get the output as:

BS0000, BS0000 solution, CS0000 InterCompany

The strings which are exactly equal only should get eliminated (as 'CS0000' in the above case).

Thanks!

Does:

perl -pe 's/$/,/; s/([^,]+)(,\1)+,//g; s/,$//;'

not satisfy your requirements? If not, what is missing?

Add an extra delimiter, do a simple global substitution, remove the extra delimiter.

sed 's/$/,/;s/\([^,]*,\)\1//g;s/,$//' file

Same idea as the previous post.

And to handle empty fields and when there number of consecutive fields is odd:

sed 's/$/,/; s/\([^,][^,]*,\)\1\{1,\}//g; s/,$//'

None of the solutions does the following file correctly:

A,A,B,B,B,preB,B,Bpost,,,,B,C,D,D

It looks like the \1 in an RE is not precise here. Maybe there is a solution in Perl RE and look-ahead?
Here is an awk solution - without RE:

awk 'BEGIN {FS=RS; RS=","} $1==buf {c++; next} c==1 {printf sep"%s",buf; sep=RS} {buf=$1; c=1} END {if (c==1) printf sep"%s",buf; if (NR>0) printf FS}' file 

Output:

preB,B,Bpost,B,C

Hi.

A quickly-cobbled-together solution using standard utilities on MadeInGermany's data:

#!/usr/bin/env bash

# @(#) s1	Demonstrate omit sequential repeated strings on lines.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C tr uniq sed

FILE=${1-data1}

pl " Input data file $FILE:"
cat $FILE

pl " Expected results:"
cat expected-output.txt

pl " Results:"
while read line
do
  printf "%s\n" $line |
  tr ',' '\n' |
  uniq -u |
  tr '\n' ',' |
  sed 's/,$//'
  printf "\n"
done < $FILE

exit 0

producing:

$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian 5.0.8 (lenny, workstation) 
bash GNU bash 3.2.39
tr (GNU coreutils) 6.10
uniq (GNU coreutils) 6.10
sed GNU sed version 4.1.5

-----
 Input data file data1:
A,A,B,B,B,preB,B,Bpost,,,,B,C,D,D

-----
 Expected results:
preB,B,Bpost,B,C

-----
 Results:
preB,B,Bpost,B,C

Best wishes ... cheers, drl

I have to admit that my regex-fu is not up to this task, at this time.
Vamsikrishna928, thank you for the challenge.

All, if there is a regex solution, please post.

If the data list will always have a comma or comma-space separator, then this might work...
This modifies original data.

# File data1 = BS0000, BS0000 solution, CS0000, CS0000, CS0000 InterCompany
               a,b,c,d,e,e,f,g
#!/bin/bash
cp ./data1 /tmp/data
sed -i 's/,\s*/\n/g' /tmp/data
while read x
do
    if [ $(egrep -c "^$x$" /tmp/data) -gt 1 ]; then
        sed -i "s/\b$x,\s*//g" ./data1
   fi
done < /tmp/data
rm /tmp/data
cat ./data1
### eof #

output
--------
B50000, BS0000 solution, CS0000 InterCompany
a,b,c,d,f,g