Remove lines if some data is the same: Centos5 / bash

oly_r · July 24, 2012, 9:33am

Ok what i have is 8 separate files based on how many IP's are associated with the domain. I want to limit duplication of ip's within the individual files (ie. i don't care if the same ip is in 1 IP file and 4 IP file). Can someone help me with how to go through the files and remove lines that have have an IP already in a previous line (within that same file). In the 2 IP file the What and Where lines should be removed since each has an ip in the Who line. In the 4 IP file when the spock line is removed since it has an ip from Kirk that file is then ok since the McCoy line is then no longer duplicating 10.100.200.200. Hopefully this is understandable (it makes sense to me ). I have been able to get rid of domains that have all the same ip's associated by using sort -k? -k? -u to ignore the first field, but can't figure out how to do single ip's from a line and test against other lines :wall:.

I'm doing this on a CentOS 5.8 box in a bash script (whole lot more processing going on all around this portion).

EXAMPLES: Colon separated lines in each file
domain.com:ip:ip:

1 IP file

Any.com:192.168.10.100:
Where.edu:192.168.10.200:

2 IP file

Who.com: 192.168.10.300:192.168.10.200:
What.gov:10.0.0.150:192.168.10.300:
Where.biz:192.168.10.200:10.10.0.10:
When.tv:192.168.10.10:192.168.10.11:

4 IP file

Kirk.ufp:10.0.100.100:10.0.200.100:10.0.200.200:10.0.100.200:
Spock.vsa:10.100.100.100:10.100.100.200:10.100.200.200:10.0.100.100
Mccoy.ama:10.100.200.200:192.168.200.200:192.168.100.200:192.168.100.201

Corona688 · July 24, 2012, 1:13pm

What output do you want from this input?

oly_r · July 24, 2012, 1:28pm

The files end up having only one line that contains any one ip. It sounds confusing to me an i'm the one asking for help. If the IP is associated with a domain already i don't want it listed again.

1 IP file

Any.com:192.168.10.100:
Where.edu:192.168.10.200:

2 IP file

Who.com: 192.168.10.300:192.168.10.200:
What.gov:10.0.0.150:192.168.10.300:
Where.biz:192.168.10.200:10.10.0.10:
When.tv:192.168.10.10:192.168.10.11:

4 IP file

Kirk.ufp:10.0.100.100:10.0.200.100:10.0.200.200:10.0.100.200:
Spock.vsa:10.100.100.100:10.100.100.200:10.100.200.200:10.0.100.100
Mccoy.ama:10.100.200.200:192.168.200.200:192.168.100.200:192.168.100.201

oly_r · July 25, 2012, 1:50pm

Thanks anyway y'all. I decided to go another direction.

What i did was cat all the individual files together. Once with the complete lines included the other time just the ip's sorted and unique. Then i did a loop reading the ip checking it against my output file to make sure the ip hasn't already been logged. Then if it hasn't, grep the domain (first field) from the full listing file and tailed the last entry (lines with the most IP's associated to the domain). lather, rinse, repeat.

Then i go through the output file and count the colons in each line and write to the respective file based on number of ip's associated to the domains.

cat ?_servers_ips |sed 's/:/\n/g' |grep -v "^$|[[:alpha:]] |sort -u > master_iplist
cat ?_servers_ips > working_output

for I in {8..1}; do
     cat /dev/null > "working_"$i"_ips"
done

while read my IP
   do
       grep $myIP working_output > /dev/null
       if [ $? == 1 ]; then
           grep $myIP all_working_ips |tail -1 >> working_output
       fi
   done < master_iplist

while read testDomain
do 
   count=`grep -o ":" <<< $testDomain| wc -l`
   count=`bc <<< $count-1`
   if [ $count -ge 8 ]; then               #lump the couple of 8 or more into 1 file
        count=8
   fi
   printf "$testDomain\n" >> "working_"$i"_ips"
done < working_output

I'm sure there are ways to improve this.