Extract the lines from input file

sandy1028 · July 8, 2010, 6:10am

This is the sample input file

b    05/Jul/2010:07:00:10
a    05/Jul/2010:06:00:10
b    05/Jul/2010:07:00:10
c    05/Jul/2010:07:10:10
d    05/Jul/2010:08:00:10
e    05/Jul/2010:09:00:10
f    05/Jul/2010:10:00:10
h    05/Jul/2010:11:00:10
i    05/Jul/2010:12:00:10
j    05/Jul/2010:13:00:10
k    05/Jul/2010:14:00:10
l    05/Jul/2010:15:00:10
m    05/Jul/2010:16:00:10
n    05/Jul/2010:17:00:10
o    05/Jul/2010:18:00:10
p    05/Jul/2010:19:00:10
q    05/Jul/2010:20:00:10
a    05/Jul/2010:21:00:10
b    05/Jul/2010:22:00:10
v    05/Jul/2010:23:00:10
g    06/Jul/2010:01:00:10
k    06/Jul/2010:02:00:10
i    06/Jul/2010:03:00:10
j    06/Jul/2010:04:00:10
k    06/Jul/2010:05:00:10
l    06/Jul/2010:06:00:10
m    06/Jul/2010:07:00:10
n    06/Jul/2010:08:00:10
n    06/Jul/2010:09:00:10

I have a file, how to extract only the lines between the timestamp 05/Jul/2010:07 to 06/Jul/2010/2010:08

I mean, 05/Jul/2010:06:00:10,06/Jul/2010:09:00:10,etc should not be in the output file.

from 5th July, after 7:00 AM till next day 8:00AM should be in the output file
where 05 is three day old date and 06 is 2 day old date.

The input file is very large, and how get the output file much faster

pravin27 · July 8, 2010, 6:42am

Hi, Try this

sort -k2 filename| sed -n '/05\/Jul\/2010:07:00:10/,/06\/Jul\/2010:08:00:10/p'

sandy1028 · July 8, 2010, 7:13am

Can you please explain me how it works.

I want delete all the lines, in which timestamp doesn't falls in between 05/Jul/2010:07 to 06/Jul/2010:08.

---------- Post updated at 06:13 AM ---------- Previous update was at 06:00 AM ----------

File contains 19739530 lines. Is sorting expensive. Is there any other way?

pravin27 · July 8, 2010, 7:28am

Hi,
1) Sort the file on 2nd column
2) Using SED select the range of lines using pattern

sort -k2 filename| sed -n '/05\/Jul\/2010:07:00/,/06\/Jul\/2010:08:00/p'

---------- Post updated at 07:28 AM ---------- Previous update was at 07:15 AM ----------

Hi,

You may want to use the -T <directory> option to force sort to use a scratch directory that has enough free space. By default it uses TMPDIR - usually /tmp or var/tmp.

If you need to sort the file quickly use -y option with no argument for the option. This starts sort with the maximum free memory allowed or available.

sort -y -T /dir/having/enough_free_space/ -k2 test | sed -n '/05\/Jul\/2010:07:00/,/06\/Jul\/2010:08:00/p'

sandy1028 · July 9, 2010, 4:46am

Thanks,

If the input file is

b    05/Jul/2010:07:00:10
a    05/Jul/2010:06:00:09
b    05/Jul/2010:07:00:10
c    05/Jul/2010:07:10:16
d    05/Jul/2010:08:00:10
e    05/Jul/2010:09:00:10
f    05/Jul/2010:10:00:10
h    05/Jul/2010:11:00:10
i    05/Jul/2010:12:00:20
j    05/Jul/2010:13:00:10
k    05/Jul/2010:14:00:10
l    05/Jul/2010:15:00:30
m    05/Jul/2010:16:00:10
n    05/Jul/2010:17:00:10
o    05/Jul/2010:18:00:10
p    05/Jul/2010:19:00:40
q    05/Jul/2010:20:00:10
a    05/Jul/2010:21:00:10
b    05/Jul/2010:22:00:50
v    05/Jul/2010:23:00:20
g    06/Jul/2010:01:00:10
k    06/Jul/2010:02:00:10
i    06/Jul/2010:03:00:14
j    06/Jul/2010:04:00:10
k    06/Jul/2010:05:00:18
l    06/Jul/2010:06:00:10
m    06/Jul/2010:07:00:10
n    06/Jul/2010:08:00:19
n    06/Jul/2010:09:00:10

After sorting the file, I want all the line, starting from 05/Jul/2010:06:00:00 to 06/Jul/2010:10:00:00 into another file.

gaithrit · July 9, 2010, 5:00am

If the data file has the below data;

b    05/Jul/2010:07:00:10
a    05/Jul/2010:06:00:09
b    05/Jul/2010:07:00:10
c    05/Jul/2010:07:10:16
d    05/Jul/2010:08:00:10
e    05/Jul/2010:09:00:10
f    05/Jul/2010:10:00:10
h    05/Jul/2010:11:00:10
i    05/Jul/2010:12:00:20
j    05/Jul/2010:13:00:10
k    05/Jul/2010:14:00:10
l    05/Jul/2010:15:00:30
m    05/Jul/2010:16:00:10
n    05/Jul/2010:17:00:10
o    05/Jul/2010:18:00:10
p    05/Jul/2010:19:00:40
q    05/Jul/2010:20:00:10
a    05/Jul/2010:21:00:10
b    05/Jul/2010:22:00:50
v    05/Jul/2010:23:00:20
g    06/Jul/2010:01:00:10
k    06/Jul/2010:02:00:10
i    06/Jul/2010:03:00:14
j    06/Jul/2010:04:00:10
k    06/Jul/2010:05:00:18
l    06/Jul/2010:06:00:10
m    06/Jul/2010:07:00:10
n    06/Jul/2010:08:00:19
n    06/Jul/2010:09:00:10

then try the below command

 
 sed -n '/05\/Jul\/2010:06:00:09/,/06\/Jul\/2010:09:00:10/ p' data > output

This will redirect the output lines to file "output". Hope this helps.

sandy1028 · July 9, 2010, 8:02am

sed -n '/[5\/Jul\/2010:06:00:00/,/[6\/Jul\/2010:09:00:00/p' soredfile.tsv > output
sed: -e expression #1, char 50: unterminated address regex

b    [5/Jul/2010:07:00:10
a    [5/Jul/2010:06:00:09
b    [5/Jul/2010:07:00:10
c    [5/Jul/2010:07:10:16
d    [5/Jul/2010:08:00:10
e    [5/Jul/2010:09:00:10
f    [5/Jul/2010:10:00:10
h    [5/Jul/2010:11:00:10
i    [5/Jul/2010:12:00:20
j    [5/Jul/2010:13:00:10
k    [5/Jul/2010:14:00:10
l    [5/Jul/2010:15:00:30
m    [5/Jul/2010:16:00:10
n    [5/Jul/2010:17:00:10
o    [5/Jul/2010:18:00:10
p    [5/Jul/2010:19:00:40
q    [5/Jul/2010:20:00:10
a    [5/Jul/2010:21:00:10
b    [5/Jul/2010:22:00:50
v    [5/Jul/2010:23:00:20
g    [6/Jul/2010:01:00:10
k    [6/Jul/2010:02:00:10
i    [6/Jul/2010:03:00:14
j    [6/Jul/2010:04:00:10
k    [6/Jul/2010:05:00:18
l    [6/Jul/2010:06:00:10
m    [6/Jul/2010:07:00:10
n    [6/Jul/2010:08:00:19
n    [6/Jul/2010:09:00:10

---------- Post updated at 04:47 AM ---------- Previous update was at 04:13 AM ----------

Thanks, this worked fine.

Another problem is

I have the date in the variable as 20100705 and 20100706.

How to convert the date as 5/Jul/2010 and 6/Jul/2010 and pass the dates as a variable in the sed command

sort -k2 tmp.tsv | sed -n '/\[5\/Jul\/2010:06:00:00/,/\[6\/Jul\/2010:09:00:00/p' > tab.tsv

---------- Post updated at 07:02 AM ---------- Previous update was at 04:47 AM ----------

How to pass the variable
$date1
$date2
where $date1 is ' 6\/Jul\/2010'
and $date2 is '7\/Jul\/2010'
sort -k4 tmp.tsv | sed -n '/\[$date1:06:00:00/,/\[$date2:09:00:00/p' > tab.tsv

pravin27 · July 11, 2010, 1:44am

Hi, Try this...

date1=`date -d "20100705" '+%d\/%b\/%Y' | cut -c2-`
date2=`date -d "20100706" '+%d\/%b\/%Y' | cut -c2-` 

sort -y -T /dir/having/enough_free_space/ -k2 test | sed -n "/$date1:06:/,/$date2:09:/p"

guruprasadpr · July 11, 2010, 4:12am

Hi
Solution without sort since its a huge file:

File containing reference dates:

$ cat text
05/Jul/2010:07
06/Jul/2010:08

Script:

$ cat try

#!/usr/bin/ksh

  awk -F "[:/]" '
  BEGIN{
    a["Jan"]="01";
    a["Feb"]="02";
    a["Mar"]="03";
    a["Apr"]="04";
    a["May"]="05";
    a["Jun"]="06";
    a["Jul"]="07";
    a["Aug"]="08";
    a["Sep"]="09";
    a["Oct"]="10";
    a["Nov"]="11";
    a["Dec"]="12";
    getline < "text"
    date=int($1 a[$2] $3 $4 "00")
    getline < "text"
    date1=int($1 a[$2] $3 $4 "00")
   }{
     x=int(substr($1,length($1)-1) a[$2] $3 $4 $5)
     if (x >=date && x<=date1)
       print;
  }' infile

where infile is your input file

On running

$./try

b    05/Jul/2010:07:00:10
b    05/Jul/2010:07:00:10
c    05/Jul/2010:07:10:10
d    05/Jul/2010:08:00:10
e    05/Jul/2010:09:00:10
f    05/Jul/2010:10:00:10
h    05/Jul/2010:11:00:10
i    05/Jul/2010:12:00:10
j    05/Jul/2010:13:00:10
k    05/Jul/2010:14:00:10
l    05/Jul/2010:15:00:10
m    05/Jul/2010:16:00:10
n    05/Jul/2010:17:00:10
o    05/Jul/2010:18:00:10
p    05/Jul/2010:19:00:10
q    05/Jul/2010:20:00:10
a    05/Jul/2010:21:00:10
b    05/Jul/2010:22:00:10
v    05/Jul/2010:23:00:10
g    06/Jul/2010:01:00:10
k    06/Jul/2010:02:00:10
i    06/Jul/2010:03:00:10
j    06/Jul/2010:04:00:10
k    06/Jul/2010:05:00:10
l    06/Jul/2010:06:00:10
m    06/Jul/2010:07:00:10
n    06/Jul/2010:08:00:10

Guru.

Scrutinizer · July 11, 2010, 5:33am

pravin27:

Hi, Try this...

date1=`date -d "20100705" '+%d\/%b\/%Y' | cut -c2-`
date2=`date -d "20100706" '+%d\/%b\/%Y' | cut -c2-`

sort -y -T /dir/having/enough_free_space/ -k2 test | sed -n "/$date1:06:/,/$date2:09:/p"

IMO this isn't going to work, since it will be coincidental that an alphabetic sort works in this particular example. The line should be sorted by date which is not trivial. Also, the sed statement will only work if both begin stamp and end stamp happen to be present, which is not guaranteed.

---------- Post updated at 11:33 ---------- Previous update was at 10:16 ----------

Try a script like this, which does not call any external programs:

getdate()
{
  IFS="$IFS:/"
  set -- $*
  case $2 in
    Jan) mon=01;;
    Feb) mon=02;;
    Mar) mon=03;;
    Apr) mon=04;;
    May) mon=05;;
    Jun) mon=06;;
    Jul) mon=07;;
    Aug) mon=08;;
    Sep) mon=09;;
    Oct) mon=10;;
    Nov) mon=11;;
    Dec) mon=12;;
  esac
  echo $3$mon$1$4$5$6
}
var1=20100705
var2=20100706
startdate=${var1}070000
enddate=${var2}080000

while read line
do
  stamp=${line##*[[:space:]]}
  linedate=$(getdate "$stamp")
  if [ $startdate -le $linedate ] && [ $enddate -gt $linedate ]; then
    printf "%s\n" "$line"
  fi
done <infile >outfile

and see if it is fast enough. Otherwise it will need to be awkified....