Looping in case of duplicates

barath · October 11, 2013, 8:50am

14`~abc`~9`~11
14`~abc`~9`~10
36`~ee`~7`~9
36`~ee`~8`~9
58`~rtt`~12`~7
70`~gff`~13`~8
86`~tyu`~6`~12
86`~tyu`~6`~13
92`~mjh`~5`~6
28`~jkl`~10`~DNA
32`~mjk`~SNA`~5
82`~jkli`~11`~DNA

------------------------------------

Field Seperator: `~
The concept is to start with SNA in the 3rd Field.
Fetch the 4th field, search that 4th field content in the
third field of some other line , save the 1st field, get that line's 4th field and so on till
I reach DNA in the 4th field.
I have already developed part of the script which does that. But I am not sure how to handle, when I search for 4th field content in 3rd field, if it has more than one occurance.
I am getting the below output:

32~92~86~58~36~14~82

Expected output for the above example:

32~92~86~58~36~14~82
32~92~86~58~36~14~28
32~92~86~70~36~14~82
32~92~86~70~36~14~28

Could some one plz help me with this logic approach with unix script.

Thanks in Advance

RudiC · October 11, 2013, 9:17am

Please use code tags as required by forum rules!

Why don't you post your script that solves part of the problem, and why don't you specify correctly and in detail? e.g. the 32 at the start of each solution comes from where? SNA, I know, but don't leave us guessing. Also, you should clearly state that ALL four solutions are needed, i.e. duplicates in $3 are equivalent.

barath · October 15, 2013, 5:00am

 
complinagetest () #function name
{
if [ -f complins.dat ];then
rm complins.dat
fi
touch complins.dat
i=0
while read line
do
if [ $line == "SNA" ]; then
 
 va=`grep -w "$line" datalins1.dat | awk  BEGIN'{FS="\`~"}{if ( $3=="'$line'" ) {print $4}}'`
 i=$(($i+1))
 varits=$(echo $va|awk -v varif="$i" '{print $varif}')
 
      if [ "$varits" = "DNA" ]; then
          grep -w "$varits" datalins1.dat | awk  BEGIN'{FS="`~"}{if ( $3=="'$line'" ) {print $1}}'|sed 's/$/~>/' >> complins.dat
        else 
       grep -w "$varits" datalins1.dat | awk  BEGIN'{FS="`~"}{if ( $3=="'$line'" ) {print $1}}'| tr '\n' '~' | sed 's/~/~>/g'   >> complins.dat
 
 
 while [ "$varits" != "DNA" ]
 do
 
                while read line
                do
 
                  if [ $line == "$varits" ] ; then
                varits=`grep -w "$line" datalins1.dat | awk  BEGIN'{FS="\`~"}{if ( $3=='$line' ) {print $4}}'`
                check=`echo $varits|awk '{print NF}'`
 
                        if [ "$check" == "1" ];then 
                echo "inside if"                  
                grep -w "$varits" datalins1.dat  | awk  BEGIN'{FS="`~"}{if ( $3=='$line' ) {print $1}}'| tr '\n' '~'| sed 's/~/~>/' >> complins.dat
 
                        else
 
    echo "problem occurs here in else case when $check > 1(ie. $4 from datalins1.dat has two number for same value of $1)"
 
                         fi
 
                fi
 
                done < source.dat
 
 done
      fi
echo "" >> complins.dat
fi
done < source.dat
 
}
 
complinagetest #calling function

--------------------------------------------------------------------------
input files for above function are datalins1.dat and source.dat
output files for above function is complins.dat

#datalins1.dat
14`~abc`~9`~11
14`~abc`~9`~10
36`~ee`~7`~9
36`~ee`~8`~9
58`~rtt`~12`~7
70`~gff`~13`~8
86`~tyu`~6`~12
86`~tyu`~6`~13
92`~mjh`~5`~6
28`~jkl`~10`~DNA
32`~mjk`~SNA`~5
82`~jkli`~11`~DNA

#source.dat
9
7
8
12
13
6
5
10
SNA
11

-----------------------------------------------------------------------------
Field Seperator: `~
The concept is to start with SNA in the 3rd Field.
Fetch the 4th field, search that 4th field content in the
third field of some other line , save the 1st field, get that line's 4th field and so on till
I reach DNA in the 4th field.
I have already developed part of the script which does that. But I am not sure how to handle, when I search for 4th field content in 3rd field, if it has more than one occurance.
------------------------------------------------------------------
Expected output for the above example:

#complins.dat
32~92~86~58~36~14~82
32~92~86~58~36~14~28
32~92~86~70~36~14~82
32~92~86~70~36~14~28

-------------------------------------------------------------------
the function works perfectly if we have single number in $4 of datalins1.dat for same value in $1

Could some one plz help me with this logic approach with unix script.

Thanks in Advance..

alister · October 15, 2013, 11:58am

How many records (lines) in a typical datalins1.dat file? Are they actually as small as your sample (12 lines)? Or are they much larger?

Regards,
Alister

barath · October 15, 2013, 12:09pm

yes the file (datalins1.dat) could be much larger......not necessary only 12 lines(for the above example only 12 lines).....but the format of the file would be always constant ( four fields separated by delimiter `~ )...

alister · October 15, 2013, 1:12pm

What do you consider "much larger"? A hundred lines? A thousand? A million? A billion? A simple solution may scale from 12 to 1000, but perhaps not to a million and beyond.

It would be a shame for someone to waste their time crafting code that can never be used because it takes forever to complete or because it requires more memory than the system has available. So, please, be more precise than "much larger". Also, if the file can approach the size of your system's memory, you should definitely mention that.

Regards,
Alister

barath · October 15, 2013, 1:33pm

Sorry for the inconvenience......much larger doesn't meant here that it can go to thousand or million lines...the above code is crafted on the basis of similar kind(datalins1.dat) of input scenario...

Thanks

ctsgnb · October 15, 2013, 6:45pm

Check this link as well as this one
What you are trying to do is relating to tree walk (chained list) have a look at the algorithm used to build leaf path.

Chubler_XL · October 15, 2013, 10:44pm

Using awk:

awk -F'`~' '
function from(n,pre,i,v,x)
{
    if(n in T) {
       v=split(T[n],x,",")
       for(i=1; i<=v;i++)
          from(F[x], pre "~" L[x]);
    } else print substr(pre,2);
}
{
   L[NR]=$1;
   F[NR]=$4;
   if($3 in T) T[$3]=T[$3]","NR;
   else T[$3]=NR;
}
END { from("SNA") }' infile

---------- Post updated at 12:44 PM ---------- Previous update was at 10:57 AM ----------

This can also be done with bash, however as associative arrays aren't supported I used a dummy number (9999) for SNA. Also bash will have tighter memory constraints so it will fail on smaller files than the awk solution:

#!/bin/bash
function from()
{
 local pre=$2
 if [ ${#T[$1]} -gt 0 ]
 then
     set ${T[$1]}
     while [ $# -gt 0 ]
     do
        from "${F[$1]}" "$pre~${L[$1]}"
        shift
     done
  else
     echo ${pre:1}
  fi
}

while read line
do
   set ${line//\`~/ }
   ((L[++ln]=$1))
   ((F[ln]=$4))
   [ "$3" = "SNA" ] && T[9999]="${T[9999]} $ln" || T[$3]="${T[$3]} $ln"
done < infile

from 9999