Duplicate rows in a text file

notes: i am using cygwin and notepad++ only for checking this and my OS is XP.

#!/bin/bash
typeset -i totalvalue=(wc -w /cygdrive/c/cygwinfiles/database.txt)
typeset -i totallines=(wc -l /cygdrive/c/cygwinfiles/database.txt)
typeset -i columnlines=`expr $totalvalue / $totallines`
awk -F' ' -v columnlines=$columnlines '{ if($1==$columnlines) {print $0} }' /cygdrive/c/cygwinfiles/database.txt

this is my first script construction so kindly pls bear with me. i just need ur help. the:

totalvalue is the number of values in the data
totallines is the number of lines in the data
these 2 are needed to count total columns
(pretty lame script and very basic since i dont know much)

if i have a data file who looks like:

aaa bbb ccc aaa
ccc eee ggg hhh
eee bbb eee eee

will return rows that have duplicates so, the output is

aaa bbb ccc aaa <two aaa's>
eee bbb eee eee <two eee's>

any help would be appreciated.

ERRORS are returned and I think the errors are in the variables. they seem not to be recognized as integers.
i am returning an error with the msg ")division by 0 (error token is "/c/cygwinfiles/database.txt)

---------- Post updated at 06:17 PM ---------- Previous update was at 06:15 PM ----------

the returned 2nd row contains 3 eee's (sorry for that)

awk '{for (i=1;i<=NF;i++) {if ($i in a) {print;break} else {a[$i]}};delete a}' infile
1 Like

try this AWK file,you can use it by:

awk -f awkfile inputfile
 {
  2    for(i=1;i<=4;i++)a[$i]++
  3    if(a[$1]+a[$2]+a[$3]+a[$4] > 4)
  4       printf "%s <",$0;
  5    for(i=1;i<=4;i++){
  6       if(a[$i]>2){
  7          printf "%d %s's ",a[$i],$i
  8          break;
  9       }else if(a[$i] == 2 && $i != save){
 10          printf "%d %s's ",a[$i],$i
 11          save=$i
 12       }
 13    }
 14    if(a[$1]+a[$2]+a[$3]+a[$4] > 4)
 15       printf ">\n"
 16    delete a
 17    save=""
 18 }

1 Like
awk '{for (i=1;i<=NF;i++) {if ($i in a) {print;break} else {a[$i]}};delete a}' infile

woah! it worked like a charm! now what ima do now is just to educate myself about these codes. thank you very much rdcwayx!

@homeboy
im thankful also for helping out. i just want to know y i cant't run properly bash scripts in cgywin. ima try this at once and find some program for me for running this in xp.

thank you very much guys.

---------- Post updated at 10:40 PM ---------- Previous update was at 08:01 PM ----------

now i have this odd assumption.
if the data were to be

As1d Pooa1 982ah
ghqyqt1 ss92 a82ss
Bg1ja Bg1ja 13ss

how can i achieve an output of

Bg1ja Bg1ja 13ss

meaning that line is duplicate

this in a sense assuming all Alphanumeric chars are used instead of small letters only.
will i use [A-Za-z0-9]? how will i inject it to the code?

Not really understand, with my code, I still can get the line:

Bg1ja Bg1ja 13ss

Do you ask for case insensitive ?

awk '{for (i=1;i<=NF;i++) {if (tolower($i) in a) {print;break} else {a[tolower($i)]}};delete a}' infile
1 Like
awk '{for (i=1;i<=NF;i++) {if (tolower($i) in a) {print;break} else {a[tolower($i)]}};delete a}' infile

this is perfect!

this would help me a lot for my database learning in unix.

so it actually analyzes the values as lower case but prints the line itself. thank you again rdcwayx!