Remove duplicate occurrences of text pattern

Hi folks!

I have a file which contains a 1000 lines. On each line i have multiple occurrences ( 26 to be exact ) of pattern folder#/folder#.

# is depicting the line number in the file

some text here folder1/folder1 some text here folder1/folder1 some text here folder1/folder1 some text here 
some text here folder2/folder2 some text here folder2/folder2 some text here folder2/folder2 some text here 
some text here folder3/folder3 some text here folder3/folder3 some text here folder3/folder3 some text here 
.... up to line 1000

I'm trying to remove the duplicate occurrences so i can end up with the following:

some text here folder1/ some text here folder1/ some text here folder1/ some text here 
some text here folder2/ some text here folder2/ some text here folder2/ some text here 
some text here folder3/ some text here folder3/ some text here folder3/ some text here 
.... up to line 1000

Thanks so much for any help!

Hello martinsmith,

Could you please try following and let me know if this helps.
i- If your complete data is as same as you have shown, means each line has it's same LINE with line number and not more thn 4 fields in Input_file then following may help.

awk '{Line="Line" NR":";Folder="folder" NR"/";print Line OFS Folder OFS Folder OFS Folder}' Input_file

2nd: If you may have different data like different LINE numbers. number of columns(But considering that columns which have LINE string willhave only 2 columns serated by / ) may vary than following may help you in same.

awk '{for(i=2;i<=NF;i++){split($i, A,"/");if(A[1]==A[2]){Q=Q?Q OFS A[1] "/":A[1] "/"} else {Q=Q?Q OFS $i:$i};}print $1 OFS Q;Q=""}'  Input_file

Output will be as follows in both above conditions.

Line1: folder1/ folder1/ folder1/
Line2: folder2/ folder2/ folder2/
Line3: folder3/ folder3/ folder3/
Line4: folder4/ folder4/ folder4/
Line5: folder5/ folder5/ folder5/

Thanks,
R. Singh

1 Like

Hi R. Singh,

Thanks very much. Your solution does work. Unfortunately i did not describe my issue more clearly so it did not work for my problem. I have updated the question with more clarification.

So basically on each line i have a whole bunch of different text, and within each line of text i have 26 occurrences of folder#/folder# at various places between the text. I just need the duplicate removed.

Thanks so much!

Hello martinsmith,

I could see my 2nd code works in POST#2, following is the code for same.
Let's say we have following Input_file:

some text here folder1/folder1 some text here folder1/folder1 some text here folder1/folder1 some text here 
some text here folder2/folder2 some text here folder2/folder2 some text here folder2/folder2 some text here 
some text here folder3/folder3 some text here folder3/folder3 some text here folder3/folder3 some text here

When I run code as follows.

awk '{for(i=2;i<=NF;i++){split($i, A,"/");if(A[1]==A[2]){Q=Q?Q OFS A[1] "/":A[1] "/"} else {Q=Q?Q OFS $i:$i};}print $1 OFS Q;Q=""}'  Input_file

Output will be as follows.

some text here folder1/ some text here folder1/ some text here folder1/ some text here
some text here folder2/ some text here folder2/ some text here folder2/ some text here
some text here folder3/ some text here folder3/ some text here folder3/ some text here

Thanks,
R. Singh

1 Like

Please, try:

 perl -pe 's:(folder\d+)/\1:$1/:g' martinsmith.file
some text here folder1/ some text here folder1/ some text here folder1/ some text here
some text here folder2/ some text here folder2/ some text here folder2/ some text here
some text here folder3/ some text here folder3/ some text here folder3/ some text here
1 Like

Hi R. Singh,

Yes the 2nd one worked perfectly. I overlooked it!

awk '{for(i=2;i<=NF;i++){split($i, A,"/");if(A[1]==A[2]){Q=Q?Q OFS A[1] "/":A[1] "/"} else {Q=Q?Q OFS $i:$i};}print $1 OFS Q;Q=""}'  Input_file

Thanks so much for your help. It's much appreciated, will save me a lot of work :smiley:

Cheers

You could also try:

sed 'sX \([^ ]*\)/\1X \1/Xg' file
1 Like

[quote=aia;302964465]
Please, try:

 perl -pe 's:(folder\d+)/\1:$1/:g' martinsmith.file

This too worked just as well. Thank you Aia :b: