Urgent help pls.how to extract two lines having same starting number

Hi ,

I have a huge file like this

=245 this is testing
=035 abc123
=245 this is testing1
=035 abc124
=245 this is testing2
=035 abc125
=035 abc126
=245 this is testing3

here i have to pull out those lines having two =035 instead of alternative 035 and 245 i.e extract abc125 and abc126. any command or script for this . please help

Regards
uma

So if two lines begin the same then pull out those values? Or if two lines are exactly =035, then extract the rest of the line? Or if any number of consecutive lines are =035 then print out them all? Give this a go:

awk 'last==$1 { if (value) print value; print $2; value=""; next; } { last=$1;value=$2 }' filename.txt

Try:

awk '/035/ { f++; s=s" "$2; } !/035/ {f=0;s="";} f==2 { print s; }' file

Hi Otheus & Dennis

Thanks a lot !!!.It works accordingly.Once again Thanks for the timely help.

Regards
Uma

I almost feel good about this. I still have no idea what umapearl really wanted :slight_smile:

:wink: Thats funny...:smiley:

umapearl looks for consecutive lines containing 035 and take the 2nd column of it.. [Or am I got it wrong???:confused:]

Hi Otheus & Denny,

Ya Denny is wright:)

I think little amendment is required, dono if possible

=245 this is testing
=035 abc123
=245 this is testing1
=035 abc124
=245 this is testing2
=035 abc125
=035 abc126
=245 this is testing3
=035 abc127
=035 abc128
=035 abc129
=245 this is testing 4

Here it is extracting abc125,abc126 ,abc127,abc128,abc129 but it should not extract abc126 and abc129 because it is followed by =245 line.

Appreciate your time

Regards
uma

So it should print lines only if followed by the same starting sequence?

Thanks for the reply. Ya you are wright like , it should check the following line is also =035 if so print the second column value, if followed by =245 should skip it.Please guide

Regards
uma

This does what you want (I think...)

$ perl -0777 -nle 'print "\nBefore:\n$_\n\n"; s/^(.*)/\1\n/sg; s/=035\s+.*?\n=245.*?\n//g; print "Eliminate pattern:\n$_\n\n"; while (/=035\s+(.*)/g) {print "$1\n";}' data4.txt
Before:
=245 this is testing
=035 abc123
=245 this is testing1
=035 abc124
=245 this is testing2
=035 abc125
=035 abc126
=245 this is testing3
=035 abc127
=035 abc128
=035 abc129
=245 this is testing 4

Eliminate pattern:
=245 this is testing
=035 abc125
=035 abc127
=035 abc128


abc125
abc127
abc128

diff urfile <(uniq -f 2 urfile)
7d6
< =035 abc126
10,11d8
< =035 abc128
< =035 abc129

Hope this will help.

awk '!/035/ {f=0;s="";}f{ print s} /035/ {f++; s=$NF; } ' file

Example:

Hi,Drewk thanks a lot , it works perfect, can u also help me to understand the codelogic.

Hi Dennis ,thanks when i run your code, its more or less correct but extracting few others which is followed by =245 , this is happening especially in the starting and in the end.

Hi rdcwayx, sorry diff is not working.

Glad it worked. It is not that hard once you get the right regex pattern.

Here is a simpler, easier version to understand:

$ cat data4.txt
=245 this is testing
=035 abc123
=245 this is testing1
=035 abc124
=245 this is testing2
=035 abc125
=035 abc126
=245 this is testing3
=035 abc127
=035 abc128
=035 abc129
=245 this is testing 4

$ perl -0777 -nle 'while (/=035\s+(.*)\n(?!=245)/g) { print "$1\n"; }' data4.txt
abc125
abc127
abc128

So two things to note:

perl -0777 -nle 'while (/=035\s+(.*)\n(?!=245)/g) { print "$1\n"; }' data4.txt

1) The invocation of perl with -0777 means slurp the whole file. This means the entire file will be in memory since you are referring to multiple lines. You could write something that will read multiple lines, but that is more complex logic. Perl can handle very big files this way, but nonetheless, it may fail with really huge files...

2) Note the Regex of "/=035\s+(.*)\n(?!=245)/g" used in the while loop. Here are the details:

"=035\\s\+" matches the =035 then any number of non CR whitespace until anything that is not whitespace; 
"\(.*\)" captures the remainder of the line, up to the \\n;
"\\n" matches the end of line;
"\(?!=245\)" is a 'zero-width negative lookahead assertion'. In plain English, that means 'don't match the last bit if the next bit is true;'
"g" means all of these patterns.

On the last post, I did it quickly which usually means more sloppy. The last one first printed the input, then deleted the pattern matching a line with =035.* followed by a line with =254.* -- then print the remaining =035 lines. I did it stepwise instead of one sweep...

I cannot overemphasize how easy this becomes if you use a regex tool.

Try this: Regex Powertoy (interactive regular expressions) or this RegExr: Online Regular Expression Testing Tool

Either one, you can just play with patterns on your sample text until it does what you expect. There >>can<< be some bugs, such as gskinner does not handle the negative lookahead or lookbehind assertions properly, but it sure beats scratching your head...

Cheers,

1 Like

Hi Drewk,

Thanks for the detailed explanation. The tool you referred is very handy. Trying to learn from that.Appreciate your guidelines.Very helpful.

Regards
uma

set -A array $(</tmp/inputfile) #reading file in array

c= ${#array
[*]} #the no. of elements in the array
 
i=1
 
while [ $i -lt $c ]
do
   j=`expr $i +1`
   v1=`echo ${array[$i]}|cut -d" " -f1`
   v2=`echo ${array[$j]}|cut -d" " -f1`
   
   if [ $v1 = $v2 ]
       then
             array[$i]=`echo F.${array[$i]}`
   fi
   i=`expr $i +1`
done
 
i=1
 
while [ $i -le $c ]
do
   x=`echo ${array[$i]|cut -c1`
   
   if [ $x -ne "F" ]
     then
           `echo ${array[$i]}>>newfile`
   fi
done
 
`cat newfile`

see if the code given above works.

here i am trying to store each lines of the inputfile in an array.
c is the number of array elements. (or the numbert of lines in the input file.)
i am checkimg of the two consecutive fileds have same value. if true i am setting F as the 1st character in the 1st of the two lines. then store the same in the array.

later in another loop i am checking if the 1st character of any line is not F (this marks that this line has not been repeated). if true i am writing that to newfile.

newfile is the reqd ouput file.

Thanks a ton, very helpful and informative

Regards
Uma

Agreed. I think everyone should use these types regex tools. I do and they save a lot of time. I also use "fat client" version on my XP machine. It is really good and stores prior patterns, explains things, etc.

heres a python code :slight_smile:

heres a python code
#code:
#!/usr/bin/env python
present_line=[]
fob = open('question.temp', 'r')
previous_line = []
while 1:

present_line = []
linea = fob.readline()
present_line = linea.split()
try:
if (present_line[0]==previous_line[0]):
print ' '.join(previous_line)

except:
pass

previous_line = present_line
if not present_line : break

fob.close()