Deleting consecutive equal values in a file

vamsikrishna928 · November 17, 2014, 11:34pm

Hello everyone,

I have a requirement as shown below. I need to delete consecutive same values from the source file and give it as output file.

Source:

a,b,c,d,e,e,f,g

Target:

a,b,c,d,f,g

The repeating value "e" should be deleted from the file completely. How can I achieve this through Unix script... thanks in advance.

Note: The position of the repeating string may not be same always. It may change.

RavinderSingh13 · November 17, 2014, 11:47pm

Hello Vamsikrishna928,

Following may help you in same.

awk -F"," '{for(i=1;i<=NF;i++){if($i == V){$i="\b"} if(i==NF){ORS="";print $i"\n"} else {print $i;V=$i}}}' ORS="," Input_file

Output will be as follows.

a,b,c,d,e,f,g

Thanks,
R. Singh

vamsikrishna928 · November 17, 2014, 11:59pm

Thanks for the reply..

But the output should completely eliminate the matching string "e". It output file should be like

a,b,c,d,f,g

RavinderSingh13 · November 18, 2014, 12:12am

Hello Vamsikrishna928,

Following may help you, not that handy but it will provide the output as requirement.

awk -F"," '{for(i=1;i<=NF;i++){if($i == V){$i="\b\b\b"} if(i==NF){ORS="";print $i"\n"} else {print $i;V=$i}}}' ORS="," Input_file

Output will be as follows.

a,b,c,d,f,g

Thanks,
R. Singh

Akshay_Hegde · November 18, 2014, 12:24am

Try

awk -F, 'NF{s="";for(i=1; i<=NF; i++){ if($i == $(i+1)){i+=1; continue } s = s (length(s)?OFS:"") $i }print s}' OFS=',' infile

$ echo 'a,b,c,d,e,e,f,g' | awk -F, 'NF{s="";for(i=1; i<=NF; i++){ if($i == $(i+1)){i+=1; continue } s = s (length(s)?OFS:"") $i }print s}' OFS=','
a,b,c,d,f,g

Scrutinizer · November 18, 2014, 1:11am

Try:

sed 's/\([^,]\{1,\}\),\1,\{0,1\}//g' file

Or GNU -r / BSD -E:

sed -E 's/([^,]+),\1,?//g' file

Or perl

perl -pe 's/(.+?),\1,?//g' file

derekludwig · November 18, 2014, 5:38am

Not to rain on anyone's parade, but does consecutive == 2 occurrences?

perl -pe 's/(.+?)(,\1)+,?//g'

(stolen, er, adapted from Scrutinizer)

---------- Post updated at 05:38 AM ---------- Previous update was at 05:21 AM ----------

And I missed the boundary errors:

echo 'a,b,c,d,e,e,e,ee,f,g,g,gg' | perl -pe 's/(.+?)(,\1)+,?//g'
a,b,c,d,e,f,g

So let's go with:

echo 'a,b,c,d,e,e,e,ee,f,g,g,gg' | perl -pe 's/$/,/; s/([^,]+)(,\1)+,//g; s/,$//;'
a,b,c,d,ee,f,gg

vamsikrishna928 · November 18, 2014, 9:05pm

Thanks everyone for your response.

I tried with

sed 's/\([^,]\{1,\}\),\1,\{0,1\}//g' file

But the output is coming in such a way that it deletes all the similar strings in a complete row.

For example,
Source:

BS0000, BS0000 solution, CS0000, CS0000, CS0000 InterCompany

With the above code, target coming as:

solution,Intercompany

The requirement is to get the output as:

BS0000, BS0000 solution, CS0000 InterCompany

The strings which are exactly equal only should get eliminated (as 'CS0000' in the above case).

Thanks!

derekludwig · November 18, 2014, 11:26pm

Does:

perl -pe 's/$/,/; s/([^,]+)(,\1)+,//g; s/,$//;'

not satisfy your requirements? If not, what is missing?

MadeInGermany · November 19, 2014, 4:32am

Add an extra delimiter, do a simple global substitution, remove the extra delimiter.

sed 's/$/,/;s/\([^,]*,\)\1//g;s/,$//' file

Same idea as the previous post.

derekludwig · November 19, 2014, 5:52am

And to handle empty fields and when there number of consecutive fields is odd:

sed 's/$/,/; s/\([^,][^,]*,\)\1\{1,\}//g; s/,$//'

MadeInGermany · November 19, 2014, 7:00pm

None of the solutions does the following file correctly:

A,A,B,B,B,preB,B,Bpost,,,,B,C,D,D

It looks like the \1 in an RE is not precise here. Maybe there is a solution in Perl RE and look-ahead?
Here is an awk solution - without RE:

awk 'BEGIN {FS=RS; RS=","} $1==buf {c++; next} c==1 {printf sep"%s",buf; sep=RS} {buf=$1; c=1} END {if (c==1) printf sep"%s",buf; if (NR>0) printf FS}' file

Output:

preB,B,Bpost,B,C

drl · November 19, 2014, 7:37pm

Hi.

A quickly-cobbled-together solution using standard utilities on MadeInGermany's data:

#!/usr/bin/env bash

# @(#) s1	Demonstrate omit sequential repeated strings on lines.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C tr uniq sed

FILE=${1-data1}

pl " Input data file $FILE:"
cat $FILE

pl " Expected results:"
cat expected-output.txt

pl " Results:"
while read line
do
  printf "%s\n" $line |
  tr ',' '\n' |
  uniq -u |
  tr '\n' ',' |
  sed 's/,$//'
  printf "\n"
done < $FILE

exit 0

producing:

$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
Distribution        : Debian 5.0.8 (lenny, workstation) 
bash GNU bash 3.2.39
tr (GNU coreutils) 6.10
uniq (GNU coreutils) 6.10
sed GNU sed version 4.1.5

-----
 Input data file data1:
A,A,B,B,B,preB,B,Bpost,,,,B,C,D,D

-----
 Expected results:
preB,B,Bpost,B,C

-----
 Results:
preB,B,Bpost,B,C

Best wishes ... cheers, drl

derekludwig · November 20, 2014, 8:06pm

I have to admit that my regex-fu is not up to this task, at this time.
Vamsikrishna928, thank you for the challenge.

All, if there is a regex solution, please post.

ongoto · November 20, 2014, 10:01pm

If the data list will always have a comma or comma-space separator, then this might work...
This modifies original data.

# File data1 = BS0000, BS0000 solution, CS0000, CS0000, CS0000 InterCompany
               a,b,c,d,e,e,f,g
#!/bin/bash
cp ./data1 /tmp/data
sed -i 's/,\s*/\n/g' /tmp/data
while read x
do
    if [ $(egrep -c "^$x$" /tmp/data) -gt 1 ]; then
        sed -i "s/\b$x,\s*//g" ./data1
   fi
done < /tmp/data
rm /tmp/data
cat ./data1
### eof #

output
--------
B50000, BS0000 solution, CS0000 InterCompany
a,b,c,d,f,g