Filtering unique records

I need to pull the data based on Column 1,2 and 3, for column 1 having the same date pull only the first line for that hour ignore other lines. If hour changes pull the first line of that hour.

Output should have only lines 1 2 3 4 5 6 10 12 13 14 19 20 21 22

Column 1 Column 2 Column 3
A 2020-06-09 7:12:22
A 2020-06-03 18:26:07
A 2020-05-16 21:52:12
A 2020-05-16 22:37:18
B 2020-05-19 3:11:05
C 2020-06-06 21:14:44
C 2020-06-06 21:14:44
C 2020-06-06 21:14:46
C 2020-06-06 21:14:48
C 2020-06-06 22:16:50
C 2020-06-06 22:16:55
C 2020-06-06 0:16:50
C 2020-06-06 1:16:50
C 2020-06-12 22:20:00
C 2020-06-12 22:20:00
C 2020-06-12 22:20:03
C 2020-06-12 22:20:03
C 2020-06-12 22:20:04
D 2020-06-03 23:26:07
D 2020-05-16 21:52:12
D 2020-05-16 22:37:18
D 2020-05-16 23:39:18

nice of you to create a table with markdown, but it would be also easier to see the actual records without markdown table - just the markdown code tags.
I'm listing it here - modify if the actual data looks different:

A 2020-06-09 7:12:22
A 2020-06-03 18:26:07
A 2020-05-16 21:52:12
A 2020-05-16 22:37:18
B 2020-05-19 3:11:05
C 2020-06-06 21:14:44
C 2020-06-06 21:14:44
C 2020-06-06 21:14:46
C 2020-06-06 21:14:48
C 2020-06-06 22:16:50
C 2020-06-06 22:16:55
C 2020-06-06 0:16:50
C 2020-06-06 1:16:50
C 2020-06-12 22:20:00
C 2020-06-12 22:20:00
C 2020-06-12 22:20:03
C 2020-06-12 22:20:03
C 2020-06-12 22:20:04
D 2020-06-03 23:26:07
D 2020-05-16 21:52:12
D 2020-05-16 22:37:18
D 2020-05-16 23:39:18

Have you tried anything on your end and if so where exactly are you stuck?
There's a slew of similar questions in these forums and on the web dialing with "uniqueness" of one sort or the other. Have you searched anywhere for hints?
Usually it's a one-liner and it's in this case as well...

actual data looks the same instead of A B C these are names

ok, how can we help?

I am not sure how to get the info, tried using awk and grep not working

Please share what you've tried so far to see your trend of thought...
Have you searched for similar questions as previously suggested?

1 Like

awk is perfect for this. Form a key from $2 (field 2) and from $3 (the string left from the :)
Then compare the key with the key from the previous line...

2 Likes

... or change the FS to also include :

I was still not able to the required information for each hour i was able to get but for each day i was not able to make query

cat final4 | awk  ' { print $1$2} '   |cut -d: -f1-2|  sort  | uniq  |  cat final4 | awk  ' { print $1$2} '   |cut -d: -f1-2|  sort  | uniq

I still don't quite understand what the above means, but to ease the OP's path forward:

awk -F'[ :]+' '!a[$1,$2,$3]++' final4 

where final4 file is your data sample from post, results in:

A 2020-06-09 7:12:22
A 2020-06-03 18:26:07
A 2020-05-16 21:52:12
A 2020-05-16 22:37:18
B 2020-05-19 3:11:05
C 2020-06-06 21:14:44
C 2020-06-06 22:16:50
C 2020-06-06 0:16:50
C 2020-06-06 1:16:50
C 2020-06-12 22:20:00
D 2020-06-03 23:26:07
D 2020-05-16 21:52:12
D 2020-05-16 22:37:18
D 2020-05-16 23:39:18
1 Like

thanks a lot its working

The array a[] holds the keys as index plus numbers as value, all in memory.
If your file is long, it makes sense to save some memory by not storing a value.

awk -F'[ :]+' '{ key=($1 SUBSEP $2 SUBSEP $3) } !(key in a) { a[key]; print }' final4

If your file is always sorted then you can even only store the key from the previous line (not all keys in an array)

awk -F'[ :]+' '{ key=($1 SUBSEP $2 SUBSEP $3) } key!=prevkey { prevkey=key; print }' final4

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.