Reading and writing in same file

Hi All,

Here is my requirement. I am grepping through the log files and cutting some fields from the file to generate a csv file. Now I have to check if 2nd field is having some fixed value then with the help of 4th field I have to look in same log and run another grep command to retrieve the corresponding 2nd field to replace the fixed value.

Example:-

12345,none1111,55,link1
56789,dsadsad,66,ewqrrwe
23456,none1111,77,link2
65655,yuytuytds,88,ertywd

If second column equals 'none1111' then I have to run a grep command on same log file by searching for link1
Suppose the grep command returns me value 'abcde' for first row then I have to replace the 'none1111' with 'abcde'.
Same applies for link2.

Final output:-

12345,abcde,55,link1
56789,dsadsad,66,ewqrrwe
23456,efghi,77,link2
65655,yuytuytds,88,ertywd

Thanks in Advance.

Where is 'abcde' and 'efghi' in your sample (original) data?

tyler_durden

Hi tyler,

  1. 'abcde' and 'efghi' wont be there in original data. That is what I tried to explain.
    The original data is not having the expected value. 'none1111' is the default value if no proper data is found after the 1st grep command.
  2. So we have to run a 2nd grep command to replace those default value with expected value with the help of 4th column.
  3. 4th column is the only connector b/w the two different log to get the actual 2nd column. So from the first set of data we need to pick up the 4th column (whose 2nd column is 'none1111') and we need to look for other log where we have the same 4th column to pick up the right 2nd column from next log.

Example:- First set of data

  1. Consider some grep command gave below result.
12345,none1111,55,link1
56789,dsadsad,66,ewqrrwe
23456,none1111,77,link2
65655,yuytuytds,88,ertywd
  1. In second column we get 'none1111' that means we have to run another grep command to pick up right 2nd column value.
    Consider we ran another different grep command which takes 'link1' as input and return expected 2nd column i.e 'abcde'.
  2. Then we have to replace all the 'none1111' value by iterating the first set of data and following the same procedure.

Hope it makes a bit clear to you.

kinda long.

#!/bin/sh

while read line
do
        flag=`echo $line | grep "none1111"`
        if [ ! -z $flag ] ; then
                lastcol=`echo $line | awk -F, '{print $4}'`
                if [ $lastcol = "link1" ] ; then
                        echo $line | sed 's/none1111/abcde/' >> out
                elif [ $lastcol = "link2" ] ; then
                         echo $line | sed 's/none1111/efghi/' >> out
                fi
        else
                echo $line  >> out
        fi
done < inputfile

What is clear to me is:

  1. You have a log file. We don't know what this log file looks like.
  2. You have run grep on the this log file and manipulated the lines that came back to produce a CSV file containing four columns.
  3. You won't tell us how you manipulated the data you grepped to produce the CSV file.
  4. You have another log file. We don't know what this log file looks like either.
  5. You want to match the 4th field in the CSV file you created against something in this second log file and replace none1111 with something else in the second log file.
  6. You aren't giving us enough data to help you.

Show us the names and contents of both log files. Explain the procedures you use to create your CSV file from the first log file. Show us the data in the 2nd log file that we are to match and show us how we determine the replacement for none1111 when we find the correct line in the 2nd log file. Then we may be able to help you.

Note that ryandegreat25 gave you a script that does what you have requested, for the values you've shown us, but almost certainly is not a general solution to the problem you're trying to solve.

Please give us enough information to be able to help you.:wall:

While each of Don Cragun's arguments holds true, just for the exercise I'm trying to infer the logics from your first post: field 4 is the link that glues together two records, of which field 2 is filled with "none1111" if the system did not have the correct value yet by the time of file creation. Both records need to be included at their original location in the file, and "none1111" needs to be replaced by the correct value found later in the same file for field 4. I'm sure there will be much more elegant solutions, but, as you are grepping several times, so do I. We need to run through the input file at least three times - a) to find the field 4 values, b) to find the field 2 values, and c) to replace. Step b) will be repeated for every pattern occurrence in the ifile. Here we go:

grep none1111 infile |
      { IFS=","; while read a b c d; do grep $d infile |
            { while read e f g h; do [ $f != 'none1111' ] &&
                  echo s/none1111\\\(.*$h\\\)/$f\\1/ >>sedfile; done;
            }; done
      }; sed -f sedfile infile; rm sedfile

Well, this one would remove the problem of repeated grep runs using xargs (whose known problem hopefully won't hit us in this case):

grep none1111 infile |
      { IFS=","; while read a b c d; do echo -n $d"|"; done; echo "#"; } |
      xargs -I xy grep -E "xy" infile |
      { IFS=","; while read e f g h; do [ $f != 'none1111' ] &&  echo s/none1111\\\(.*$h\\\)/$f\\1/ >>sedfile; done; }
      sed -f sedfile infile; rm sedfile

And, finally, a "oneliner" in which all parameters for sed are being created in a command substitution:

sed $(
      grep none1111 infile | cut -d, -f4 | xargs -Ixy grep xy infile | cut -d, -f2,4 | grep -v none1111 |
      sed 's/\(.*\),\(.*\)/-e s#none1111\\\(.*\\\)\2#\1\\1\2#/'
     ) infile

Like awk better? Try this:

awk     'BEGIN {FS=OFS=","}
         {n = split($0, g); j++; for (i=1; i<=n; i++) h[j,i]=g}
         !  /none1111/ {f[$4]=$2}
         END {for (i=1; i<=j; i++) print h[i,1], h[i,2]=="none1111" ? f[h[i,4]] : h[i,2], h[i,3], h[i,4]}
        '  infile

Let me give you a complete example what I am trying to achieve.

  1. Below is the log file structure where I need 2,5 and 14th column of the logs after grepping through the linkId=1ddoic.

Log file structure:-

abc.com 20120829001415 127.0.0.1 app none11111 sas 0 0 N clk Mozilla/5.0 id=82c6a15ca06b2372c3b3ec2133fc8b14 referrer=google.com linkId=1ddoic

abc.com 20120829001416 127.0.0.1 dyn UD3BSAp8appncXlZ UD3BSAp8app4xHbz 0 0 N page Mozilla/5.0 id=82c6a15ca06b2372c3b3ec2133fc8b14 segments=

  1. Now for the 1st log you can see I have invalid(none11111) 5th column. So I have to look for the actual 5th column value. 'id' column will help you to find that. So you have to run another grep based on the 'id' value so that you can find the actual 5th column in the same log file.
  2. If you see the second log it has the exact matching 'id' value. So what I have to do I have to take the 5th column(UD3BSAp8appncXlZ) from the second log instead of the invalid one(none11111).

Output:-

20120829001415, UD3BSAp8appncXlZ, linkId=1ddoic

Note:- I have bunch of log files where I have to perform the above procedure. But I have to come up with a single file as output after grepping through all the log files.
It has a format like abc-2012-10-01_00000,abc-2012-10-01_00001.... etc.

Hope this time it makes clear.:slight_smile:
Thanks for looking into it.

This is a big improvement over what you have posted before, but there are still some ambiguities.

You say that you're showing the log file structure and say that you need fields 2, 5, and 14 and then show two lines from one or two log files. Note that the second record has 13 fields (and the last fields appears to be incomplete); not 14??? If we are to determine what is supposed to happen we need to know whether or not field 14 in both lines has the same value ( linkId=1ddoic ). (I.e., do both of these lines appear in the output of your first grep:

grep "linkId=1ddoic" log_file

And, PLEASE USE CODE TAGS when presenting file contents!

Let me try restating the problem to determine if I understand what you want done:

  1. In some places you say there are two log files, in other places you say there is one log file but you grep it twice. Which is it?
    If there is a single log file, both greps and the conversion to the desired output can be done by reading the log file just once with awk if the output order doesn't matter.
  2. The first time you read the log file, you look for entries in column 14 that match a given value (linkId=xxx) and ignore anything that doesn't match.
  3. For lines that were selected in step 2, if column 5 is not "none1111", skip to step 5.
  4. Read the same log file again (or read the second log file) looking for a line in the log where field 12 (id=yyy) matches field 12 in the line matched in step 3 AND collumn 5 is not "none1111". Use the value found in column 5 in this line as a replacement for field 5 in the line matched by step 3.
  5. Print column 2 (from the line matched in step 3), a comma, column 5 (from the line matched in step 3 [updated by the line found in the second reading of the log file if it contained "none1111" in the line matched in step 3]), a comma, and column 12 (from the line matched in step 3 with "id=" at the start of the field removed).

Is this algorithm correct?

Is there one input log file or two? What is its (or are their) name(s)?

Is step 4 only supposed to be performed for lines that have the same contents for fields 14? Or, is any field 14 value OK as long as the contents of field 12 matches a field 12 in a line that has a field 12 that doesn't contain "none1111"?

Note that this is three comma separated values; not four values that were specified in the first several messages on this thread. Is this correct? If not, where does the other output field come from?

Note also that the early messages specified "," as the separator between fields, but in the latest messages you specify ", " intead of "," . Is "," the correct separator?

Does the order of the output lines matter?

The Note in your message:

doesn't make things clear at all. We have not seen anything like this list of values in any of the samples you have shown us. Are you saying you have to create a file with a single line that contains a comma separated list of an unspecified number of entries that consist of strings that are created using the following format string to printf: "abc-%s_%05d" where data printed by the %s is used to print something that comes from a date utility format string %Y-%m-%d run on the first day of next month and the %05d is used to print a sequence number? Please explain what the entries in this list mean, how many of them there are, and why this list is useful!

Hi Don,

I am trying my best.
This is my first grep command.

grep -e linkId=1ddoic abc-2012-10-01_000* | cut -f 2,5,14 | sort| uniq
  1. yes. Though the initial fields(upto 10th column) are constant across all type of log entries but others will vary. The 2 example log I have given are two different type of log generated for 2 different events. So you will not get the linkId attribute in 2nd log entry. Even you do not need to bother about that because you just need to pick 5th column from 2nd log and replace 1st log after checking the id field if that matches. But in real scenario you have to grep through the entire log file to look for the id value found in 1st log.

i) I have multiple log files where I need to grep(like abc-2012-10-01_00000,abc-2012-10-01_00001.... etc.) and output 2nd, 5th and 14th column.
ii) While grepping through all the log files those invalid 2nd column will appear which is not intended.In the same file from where invalid 2nd columns were found valid 2nd columns can be found from there only by looking through the matching 'id' attribute value. It is upto you if you can achieve my goal in single grep.

My algorithm:-
i) for each file run above grep
for each row got from above grep if 2nd column is invalid(none11111)
run another grep on same file and replace invalid 2nd column with new one.

If the two sample lines from your log files are as you have shown in past posts, the command line you specify above is equivalent to the command:

grep -e linkId=1ddoic abc-2012-10-01_000* | sort -u

There are no tab characters in you input files, so the cut command in your pipeline is a no-op. So this command line throws away duplicate lines found in your log files and sorts the remaining lines on the first field. It does NOT limit the output to only columns 2, 5, and 14 from your input files; does NOT produce a CSV file with three fields (and if it did; it wouldn't contain the id=value fields that say are to be used in a second grep to look for the invalid values found while processing the output from your first grep).

In message #8 in this thread, I asked nine questions. You partially answered some of the questions although, as noted above, the answer doesn't match the other statements you've made.

I want to help you solve this problem, but if you won't answer the questions (and give answers that match your data), it is obvious that I'm wasting my time. :wall: If you would like us to try to give you a working solution please answer ALL of these questions:

  1. What are the actual commands you execute to convert your log files into the CSV file that you want processed?
  2. Does abc-2012-10-01_000* match the names of all of the log files (and only those log files) that you want to process?
  3. When you find none11111 in your CSV file, will the id=xxx field ever match more than one line (not containing none11111 ) in your log files that aren't exact duplicates of other lines?
  4. Am I correct in assuming that the line matching the id=xxx field with the value needed to replace none11111 in your CSV file, will not be on a line that was selected by a grep on the linkId field you're processing?
  5. Is the field separator you want in your output file "," or ", " ?
  6. Does the order of lines in your output file matter?
  7. What is the purpose of having an additional single-line output file containing a comma separated list of all of your log files? If you need a file containing a list of the log files processed, wouldn't it be better to have the filenames on separate lines instead of separated by commas on a single line?
  8. Will the linkId=zzz field ever appear in any log file that isn't exactly of the same form as the following example line from one of your log files?
    text abc.com 20120829001415 127.0.0.1 app none11111 sas 0 0 N clk Mozilla/5.0 id=82c6a15ca06b2372c3b3ec2133fc8b14 referrer=google.com linkId=1ddoic
1 Like

Hi Don,

Thank you for your cooperation. Here I am trying to list the answer of your questions.

grep -e linkId=1ddoic abc-2012-10-01_000* | cut -f 2,5,14 | awk '{$1=$1;print}' OFS=, > /tmp.output.xls

Yes.

Yes. it matches more than one line.

Yes. of course. It will never be in same line.

Only comma. No space.

Yes. It matters. It has to be in sorted order of timestamp.

This is not the additional file. This is the output file that I use as input to generate the final output. After I create the output file after replacing the invalid 'none1111' field I read that file and on top of those values I do some database call and then creates a report.

Yes. It will appear. To resolve that problem we have to run the second grep like below.

grep id= 82c6a15ca06b2372c3b3ec2133fc8b14 abc-2012-10-01_000* | grep 'page|clk'

The purpose of running above grep is that. The id

82c6a15ca06b2372c3b3ec2133fc8b14

can appear in two different event, either 'page' or 'clk'. We can take any one 5th column from that log. And also this log will be found in the same file where 'none11111' was found.
Suppose for linkId=1ddoic we found one invalid 'none11111' value in 5th column of the log file abc-2012-10-01_00002 then in abc-2012-10-01_00002 file only the corresponding id should be found with proper 5th column.

Thank you a lot Don for looking into it.

The command lines you have shown with the two sample lines you have shown from your log files don't come close to providing the data that you say they will. I also note that your last post (message #11 in this thread) is the first time you mention anything about log file field #10 being used to determine the final report.

I have tried to interpret your requirements and come up with a script that should come close to what you have said you need. Given that you have only let us see one complete log file line and one abbreviated log file line, I have low confidence that this will actually do what you want, but I believe it meets the requirements you've been willing to share.

To try it out, save the following script in a file named match2:

#!/bin/ksh
# match2 -- Produce report from log files
#
# Usage: match2 keyId_value output_file log_file...

Usage="Usage: %s keyId_value output_file log_file...
    Output records are created for every unique input record with field 14
    that is \"keyId=keyId_value\".  The output records contain fields 2,
    5, and 14 from the selected input records.  If input field 5 is
    \"none11111\", \"none11111\" will be replaced by the contents of field 5
    in any other record in the log files that has the same value in
    field 12 as field 12 in the selected input record that does not have
    \"none11111\" in field 5.  Output records will be sorted by performing
    a numeric sort on the first output field.  The sorted output will be
    written to the file named by \"output_file\" (or to standard output
    if \"output_file\" is \"-\".\n"

base="$(basename "$0")"
if [ $# -lt 3 ]
then    printf "$Usage" "$base" >&2
        exit 1
fi
matchId_value="$1"
if [ X"$2" = "X-" ]
then    cmd="sort -n -u"
else    cmd="sort -n -u -o \"$2\""
fi
# Shift away the keyId_value and output_file operands that have already
# been saved.  This leaves just the log_file operands in "$@".
shift 2
awk -v base="$base" -v cmd="$cmd" -v link="linkId=$matchId_value" 'BEGIN {
        # Indicates exit code to use if not 0
        ec = ""
        # Set output field separator to a comma.
        OFS = ","
        # Value in field 5 indicating the the correct value is unknown.
        unknown = "none11111"
}
$14 == link {
        # Gather data for an output record for this keyId...
        # o1[x] and o2[x] are output fields 1 and 2; output field 3 is
        # a constant (link), so it does not need to be preserved from
        # each line.  nm[x] contains the "id=" fields needed to find a
        # matching record for records that did not have a valid field 5
        # when the log entry was created...
        o1[NR] = $2
        if($5 == unknown) nm[NR] = $12
        else    o2[NR] = $5
        next
}
$5 != unknown  && ( $10 == "page" || $10 == "clk" ) {
        # Save a field 5 value for the id specified by field 12...
        id[$12] = $5
}
END {   # Fill in the missing o2[x] output fields...
        for (i in nm) if((o2 = id[nm]) == "") {
                # Set o2[x] to the unknown value if no matching field
                # was found, and set the final exit code to indicate
                # that at least one entry had no match.
                o2 = unknown
                printf("%s: No valid field 5 found for %s\n", base, nm)
                ec = 2
        }
        # Write and sort the completed output records.
        for (i in o1) {
                print o1,o2,link | cmd
        }
        exit ec
}' "$@" >&2
exit

Make it executable by running the command:

chmod +x match2

and invoke it with:

match2 1ddoic output_file abc-2012-10-01_000*

to produce a report containing log file entries found in the log files you specified for linkId=1ddoic sorted by timestamp in the file named output_file.

Although this script specifies ksh, it should also work with sh and bash. (It won't work with csh or tcsh.)

I hope this helps.