Help - manipulate data by columns and repeated

kelevra · July 19, 2016, 1:07pm

Hello good afternoon to everyone.
I'm new to the forum and would like to request your help in handling data. I hope my English is clear.

I have a file (Dato01.txt) to contine the following structure.

# Col1  -  Col2 - Col3 - Col4
Patricia started Jun 22 05:22:58
Carolina started Jun 22 05:23:03
Carolina started Jun 22 05:23:37
Andrea   started Jun 22 05:25:52
Ana      started Jun 22 05:26:11
Andrea   started Jun 22 05:26:52

I have to separate the newest repeated in a file and leave only the oldest in the original file. It should be something like this:

(Dato01.txt)

# Col1  -  Col2 - Col3 - Col4
Patricia started Jun 22 05:22:58
Carolina started Jun 22 05:23:03
Andrea   started Jun 22 05:25:52
Ana      started Jun 22 05:26:11

(Dato02.txt)

# Col1  -  Col2 - Col3 - Col4
Carolina started Jun 22 05:23:37
Andrea   started Jun 22 05:26:52

Try it with "for, uniq, grep" but can not find the right formula, if someone can help me thank you very much.

RudiC · July 19, 2016, 3:55pm

Handling dates is one of the most difficult tasks, esp. with non-numeric month values. Fortunately, my sort ((GNU coreutils) 8.25) offers the

option. If yours does too, try

sort -k1,1 -k3M Dato01.txt | awk 'T[$1] {print > "Dato03.txt"} !T[$1]++ {print > "Dato02.txt"}'

rdrtx1 · July 19, 2016, 4:38pm

infile=Dato01.txt
outfile=Dato02.txt

[[ ! -f $infile.bk ]] && cp $infile $infile.bk

ex $infile <<EDIT
$(awk '
  NR==1 {print ":" NR " w " outfile; next;}
  {if (a[$1]++) r[c++]=NR}
  END {
     for (i=0; i<c; i++) print ":" r " w >> " outfile;
     for (i=c-1; i>=0; i--) print ":" r " d";
     print ":wq!";
  }
' outfile=$outfile $infile)
EDIT

kelevra · July 25, 2016, 11:21am

Thank you very much rdrtx1, the script does exactly what I need. It would be too much to ask if you can explain a little, someone else may also have the same question I and comments would be of great help. Again many thanks for helping my question.

rdrtx1 · July 25, 2016, 11:37am

infile=Dato01.txt                                            # set input file name variable
outfile=Dato02.txt                                           # set second file (repeat lines) name variable

[[ ! -f $infile.bk ]] && cp $infile $infile.bk               # backup input file

ex $infile <<EDIT                                            # invoke inline editor (ex) for input file (ex script built by awk)
$(awk '                                                      # uset awk to build commands for ex
  NR==1 {print ":" NR " w " outfile; next;}                  # write first line to second file (done by ex)
  {if (a[$1]++) r[c++]=NR}                                   # build repeat lines array
  END {
     for (i=0; i<c; i++) print ":" r " w >> " outfile;    # write repeat lines to second file (done by ex)
     for (i=c-1; i>=0; i--) print ":" r " d";             # delete repeat lines from input file (done by ex)
     print ":wq!";                                           # write input file (done by ex)
  }
' outfile=$outfile $infile)                                  # end of awk
EDIT                                                         # end of ex script

Don_Cragun · July 25, 2016, 4:05pm

Here is another way to do what rdrtx1 was doing just using awk to create two output files and cp to copy the updated version of the input file back to the input file when it is done. Of course, both of these suggestions depend on entries in your input files always being in increasing time order (as in your sample data):

#!/bin/ksh
# We can't use awk to overwrite the input file directly, so we create a
# temporary output file with the lines from the input file that are to be kept
# and a duplicate output file with the lines for names that appear two or more
# times in the input file.
#
# When awk compltes, if it was successful, we'll copy the temporary output file
# back to the input file.  Otherwise, the input file will not be changed.

InFile="Dato01.txt"		# Name the input file.
DupFile="Dato02.txt"		# Name the output file for duplicates.
TempFile="$InFile.$$"		# Name the temporary output file.

trap 'rm -f "$TempFile"' EXIT	# When the script completes, remove the temp file.

awk -v new="$TempFile" -v dup="$DupFile" '
NR == 1 {
	# Copy the header line from the input file to both output files.
	print > new
	print > dup
	next
}
{	if($1 in seen) {
		# We have seen this person before.  Copy this line to the
		# duplicates file.
		print > dup
	} else {
		# We have not seen this person before.  Copy this line to the
		# temporary file (which will replace the input file when we are
		# done).
		print > new

		# Note that we have seen this person.
		seen[$1]
	}
}' "$InFile" > "$TempFile" && cp "$TempFile" "$InFile"

This was written and tested using a Korn shell, but this should work with any shell that uses basic Bourne shell syntax (including ash . bash , dash , ksh , zsh , and several others; but not csh and its derivatives).

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .