Check/print missing number in a consecutive range and remove duplicate numbers

Hi,

In an ideal scenario, I will have a listing of db transaction log that gets copied to a DR site and if I have them all, they will be numbered consecutively like below.

1_79811_01234567.arc
1_79812_01234567.arc
1_79813_01234567.arc
1_79814_01234567.arc
1_79815_01234567.arc
2_86754_01234567.arc
2_86755_01234567.arc
2_86756_01234567.arc
2_86757_01234567.arc
2_86758_01234567.arc
3_82692_01234567.arc
3_82693_01234567.arc
3_82694_01234567.arc
3_82695_01234567.arc
3_82696_01234567.arc

There will be some scenario where files are not coped for some reason maybe network failure for example so there will be gap in the list of files.

So the list above may be something like below where there is a gap in a supposed to consecutive list.

1_79811_01234567.arc
1_79812_01234567.arc
1_79815_01234567.arc
2_86754_01234567.arc
2_86755_01234567.arc
2_86757_01234567.arc
2_86758_01234567.arc
3_82692_01234567.arc
3_82694_01234567.arc
3_82696_01234567.arc

Does anyone know a quick way of checking for what is the missing number in the consecutive range?

At the moment, what I am doing is I am cutting the list into 3 separate list based on the first character. The first digit is the transaction group, the second digit is the transaction number and the 3rd digit is the db id which is a constant.

Then I am reading each new list, set the first number as a 'base' and then incrementing it by 1, assign it to a variable and then comparing that number with what I read next. If they don't match, the I print that as the missing number or gap. It is a very long tedious process.

I am hoping someone know a trick of checking what is the missing number in the consecutive range and print it.

I don't want to insert the missing number in the existing list, I will be re-directing it to an exception list so I know what transaction log is missing that I will have to re-copy.

Also in some instance, I will have a listing where there will be duplicates in the listing like below

1_79811_01234567.arc
1_79812_01234567.arc
1_79812_01234567.arc
1_79813_01234567.arc
1_79812_01234567.arc
1_79814_01234567.arc
1_79815_01234567.arc
2_86754_01234567.arc
2_86756_01234567.arc
2_86755_01234567.arc
2_86756_01234567.arc
2_86757_01234567.arc
2_86756_01234567.arc
2_86758_01234567.arc
3_82692_01234567.arc
3_82692_01234567.arc
3_82693_01234567.arc
3_82694_01234567.arc
3_82695_01234567.arc
3_82694_01234567.arc
3_82696_01234567.arc

This is where the database log was send to the DR site multiple times. Is there a way to check for which lines are duplicates and how many lines/rows are there?

For example, from the latest listing above, 1_79812_01234567.arc has 3 entries, 2_86756_01234567.arc has 3 entries, 3_82694_01234567.arc has 2 entries and so on.

Any advice will be much appreciated. Thanks in advance.

I'm afraid there's no "quick way" to do what you request. It has to be done like what you describe, line by line, value by value. Why don't you post your attempt to be discussed, analysed, and hopefully improved? And, post the desired output for problem 2.

By the way, a similar problem has been solved here.

And, for your second problem, how about

sort file | uniq -c
      1 1_79811_01234567.arc
      3 1_79812_01234567.arc
      1 1_79813_01234567.arc
      1 1_79814_01234567.arc
      1 1_79815_01234567.arc
      1 2_86754_01234567.arc
      1 2_86755_01234567.arc
      3 2_86756_01234567.arc
      1 2_86757_01234567.arc
      1 2_86758_01234567.arc
      2 3_82692_01234567.arc
      1 3_82693_01234567.arc
      2 3_82694_01234567.arc
      1 3_82695_01234567.arc
      1 3_82696_01234567.arc
1 Like

To look for 'missing' files, could you get a list of the files you have into a temporary file and then generate another than contains the names you think you should have. You can then do the following:-

grep -vf  found_files  expected_files

You have to be careful that you match exactly between the two files, so if you expect to have a file that is a123 and another that is a12345, then searching in this way will not report if a123 is p[resent but a12345 is missing.

If this is a concern, build your expected list to be like this:-

^abc123$
^abc12345$

This will match the string and anchor the ends to beginning and end of line so you get a complete match.

Does that help at all? It's good to share and we might be able to suggest some improvements.

Robin

Hi all,

I've uploaded the script that I am using at the moment and some test data. It works like I intend it to, just thought maybe it can be improved somehow. So all the ones that are FAILED are the ones missing in the range of consecutive series

I didn't know I can use sort | uniq -c to check for duplicate, thanks to RudiC.

Checking up on the link below if it can be used instead.

$ ./x.ksh

-------------------------------------------------
- Running check_gap on test_clean.txt ...
-------------------------------------------------

- [ test_clean.txt.1.uniq ] ... starting gap check
   ... = Checking for 79811 => PASSED
   ... = Checking for 79812 => PASSED
   ... = Checking for 79813 => PASSED
   ... = Checking for 79814 => PASSED
   ... = Checking for 79815 => PASSED

- [ test_clean.txt.2.uniq ] ... starting gap check
   ... = Checking for 86754 => PASSED
   ... = Checking for 86755 => PASSED
   ... = Checking for 86756 => PASSED
   ... = Checking for 86757 => PASSED
   ... = Checking for 86758 => PASSED

- [ test_clean.txt.3.uniq ] ... starting gap check
   ... = Checking for 82692 => PASSED
   ... = Checking for 82693 => PASSED
   ... = Checking for 82694 => PASSED
   ... = Checking for 82695 => PASSED
   ... = Checking for 82696 => PASSED



-------------------------------------------------
- Running check_gap on test_gap.txt ...
-------------------------------------------------

- [ test_gap.txt.1.uniq ] ... starting gap check
   ... = Checking for 79811 => PASSED
   ... = Checking for 79812 => FAILED
   ... = Checking for 79813 => PASSED
   ... = Checking for 79814 => PASSED
   ... = Checking for 79815 => PASSED
   ... = Checking for 79816 => FAILED
   ... = Checking for 79817 => FAILED
   ... = Checking for 79818 => FAILED
   ... = Checking for 79819 => PASSED

- [ test_gap.txt.2.uniq ] ... starting gap check
   ... = Checking for 86754 => PASSED
   ... = Checking for 86755 => PASSED
   ... = Checking for 86756 => FAILED
   ... = Checking for 86757 => PASSED
   ... = Checking for 86758 => PASSED
   ... = Checking for 86759 => FAILED
   ... = Checking for 86760 => FAILED
   ... = Checking for 86761 => FAILED
   ... = Checking for 86762 => FAILED
   ... = Checking for 86763 => FAILED
   ... = Checking for 86764 => FAILED
   ... = Checking for 86765 => PASSED

- [ test_gap.txt.3.uniq ] ... starting gap check
   ... = Checking for 82692 => PASSED
   ... = Checking for 82693 => PASSED
   ... = Checking for 82694 => FAILED
   ... = Checking for 82695 => PASSED
   ... = Checking for 82696 => PASSED
   ... = Checking for 82697 => FAILED
   ... = Checking for 82698 => FAILED
   ... = Checking for 82699 => FAILED
   ... = Checking for 82700 => PASSED



-------------------------------------------------
- Running check_gap on test_dup.txt ...
-------------------------------------------------

- loggrp = 1 contains duplicated copies of the log
      3 1_79812_01234567.arc
      2 1_79819_01234567.arc

- [ test_dup.txt.1.uniq ] ... starting gap check
   ... = Checking for 79811 => PASSED
   ... = Checking for 79812 => PASSED
   ... = Checking for 79813 => PASSED
   ... = Checking for 79814 => PASSED
   ... = Checking for 79815 => PASSED
   ... = Checking for 79816 => FAILED
   ... = Checking for 79817 => FAILED
   ... = Checking for 79818 => FAILED
   ... = Checking for 79819 => PASSED

- loggrp = 2 contains duplicated copies of the log
      3 2_86756_01234567.arc
      2 2_86758_01234567.arc

- [ test_dup.txt.2.uniq ] ... starting gap check
   ... = Checking for 86754 => PASSED
   ... = Checking for 86755 => PASSED
   ... = Checking for 86756 => PASSED
   ... = Checking for 86757 => PASSED
   ... = Checking for 86758 => PASSED
   ... = Checking for 86759 => FAILED
   ... = Checking for 86760 => PASSED

- loggrp = 3 contains duplicated copies of the log
      2 3_82692_01234567.arc
      4 3_82696_01234567.arc
      2 3_82700_01234567.arc

- [ test_dup.txt.3.uniq ] ... starting gap check
   ... = Checking for 82692 => PASSED
   ... = Checking for 82693 => PASSED
   ... = Checking for 82694 => PASSED
   ... = Checking for 82695 => PASSED
   ... = Checking for 82696 => PASSED
   ... = Checking for 82697 => FAILED
   ... = Checking for 82698 => PASSED
   ... = Checking for 82699 => FAILED
   ... = Checking for 82700 => PASSED