Using sed, awk or perl to remove substring of all lines except the first

jacksolm · April 28, 2013, 5:41pm

Greetings All,

I would like to find all occurences of a pattern and delete a substring from the all matching lines EXCEPT the first. For example:

1234::group:user1,user2,user3,blah1,blah2,blah3
2222::othergroup:user9,user8
4444::othergroup2:user3,blah,blah,user1
1234::group3:user5,user1

This should be for all combinations of gid and user. If this can be accomplished using a sed or awk one liner that would be great. Otherwise, I guess I'll try a Perl script.

Any ideas to get me started will be greatly appreciated. I'm thinking a nested for loop with an internal sed call would be a start.

hanson44 · April 28, 2013, 5:53pm

It's unclear to me what output you want, and no point in guessing.

Also, please use code tags.

Don_Cragun · April 28, 2013, 7:42pm

I think this awk script does what you want (and you can turn it into a 1-liner if you insist, but I prefer readable). Note this this script also removes one of the duplicated group ( blah ) entries from the input line:

4444::othergroup2:user3,blah,blah,user1

awk '
BEGIN { FS = OFS = ":" }
{       n=split($4, g, /,/)
        for(i = 1; i <= n; i++)
                if(($1,g) in key) {
                        for(j = i + 1; j <= n; j++) g[j - 1] = g[j]
                        i--
                        n--
                        c = 1
                } else  key[$1,g]
        if(c) { c = 0
                $4 = n ? g[1] : ""
                for(j = 2; j <= n; j++) $4 = $4 "," g[j]
        }
        print
}' data

With your sample input, this script produces the output:

1234::group:user1,user2,user3,blah1,blah2,blah3
2222::othergroup:user9,user8
4444::othergroup2:user3,blah,user1
1234::group3:user5

As always, if you're using a Solaris/SunOS system, use /usr/xpg4/bin/awk , /usr/xpg6/bin/awk , or nawk instead of awk .

jacksolm · April 28, 2013, 8:40pm

Here's a little background information. Due to group member limitations we had to split groups into separate lines by using different group names and identical GIDs. A few help desk admins maintaining the NIS map entered users into all of the groups instead of selecting the latest group to add the user. A single GID could have several group names. I want to loop through the group file and remove the redundant entries.

Don_Cragun · April 28, 2013, 10:41pm

That was what I understood when I wrote the awk script for you. Did you decide not to try it because it is more than 1 line?

jacksolm · April 29, 2013, 6:15pm

Don,

Sorry for the delay. I just had a chance to try your solution. I had a few typos which resulted in syntax errors. After debugging, the script was able to execute. It is strange that awk complained about the single quote before BEGIN. It didn't appear to delete the user from the redundant group. Any ideas? I will post the edited script. I appreciate your time and awk expertise.

Don_Cragun · April 29, 2013, 6:44pm

I suggest you copy the script I provided and save it into a file and execute that file. When I ran that code it produced exactly the output i listed right after the script. If it isn't doing that for you, you must still have some typos.

What are the results of running the command:

uname -a

on your system? What shell are you using?

jacksolm · April 30, 2013, 7:03pm

Where is the genius button?! I was able to edit the script and change the variables around( I typed in the example in the wrong format) and voila! Many thanks to you for your awk expertise! To answer your question, it is runnning RHEL 5. Bash is the default. Thanks again!

---------- Post updated at 07:03 PM ---------- Previous update was at 06:55 PM ----------

awk
BEGIN { FS = OFS = ":" }
{       n=split($4, g, /,/)
        for(i = 1; i <= n; i++)
                if(($3,g) in key) 
                {
                        for(j = i + 1; j <= n; j++) g[j - 1] = g[j]
                        i--
                        n--
                        c = 1
                } else  key[$3,g]
                if(c) 
                { 
                    c = 0
                    $4 = n ? g[1] : ""
                    for (j = 2; j <= n; j++) $4 = $4 "," g[j]
                }
        print
}

jacksolm · May 18, 2013, 2:45pm

Don,

Thanks again for you assistance. I will keep this in my toolbox. One last request if you would, please. Can you post a few comments/pseudo code explaining your approach to this problem? I got as far as the key statement. It appears that you are looping through the array of sids from the 4th column, combining the GID and each sid as a search key. How are you performing the delete? Are you just re-writing the array without the duplicate sid? Is this in a temporary array?

Don_Cragun · May 18, 2013, 6:18pm

g[] is a list of user names on the current line. Duplicates are eliminated by removing entries from the g[] array and, if it is changed, reconstructing the 4th field on the current line before writing the updated line. key[gid, user_name] is a two dimensional array that keeps track of what user names have been seen for the gid on the current line (ignoring the group name). Here is a copy of my original script with extensive comments added. Let me know if something still is not clear.

awk '
# Input file format:
#       gid:do_not_care:gname;uname_list
# where:"gid" is a numeric string specifying the group ID number,
#       "do_not_care" is ignored by this script,
#       "gname" is an alphanumeric group name, and
#       "uname_list" is a comma separated list of alphanumeric user names.
#
# The two dimensional array key[] is indexed by the "gid" and a user name.  The
# array starts out empty.  If key[$1, user name] is present, the user name has
# been seen with the "gid" (either earlier on this line or on an earlier line).
# The "gname" is ignored when making this determination, so when we are done, a
# user name will appear only once for each "gid".
BEGIN { # Set the input and output field separators to ":"
        FS = OFS = ":"
}
{       # Split the "uname_list" into n individual user names:
        n = split($4, g, /,/)
        # Update the array of user names seen for this "gid":
        for(i = 1; i <= n; i++)
                # Determine if we have seen this user name with this "gid"
                if(($1,g) in key) {
                        # We have seen this user name with this "gid".  Remove
                        # this name from the list of user names on this line:
                        for(j = i + 1; j <= n; j++) g[j - 1] = g[j]
                        i--     # Repeat the check for user name i.
                        n--     # Decrease the # of user names on this line.
                        c = 1   # Note that we have changed this line.
                } else  # We have not seen this user name with this "gid".
                        # Add an entry for this combination:
                        key[$1,g]
        # Check to see if we need to reformat the "uname_list" (because we
        # removed a user name from the list on this line).
        if(c) { # We do need to reformat the uname_list on this line:
                c = 0     # Clear the flag for the next line.
                # If there are any user names left in the list, initialize the
                # reformatted "uname_list" to the 1st user name that is left;
                # otherwise set the "uname_list" to the empty string.
                # Note that if you want to discard this "gname" if there are no
                # remaining user names in the updated "uname_list", you could
                # do that by replacing the following uncommented line with the
                # next two lines (after removing the "# " in both lines):
                # if(n == 0) next
                # $4 = g[1]
                $4 = n ? g[1] : ""
                # For each additional remaining user name (if any exist), add a
                # comma and that name to the reformatted "uname_list":
                for(j = 2; j <= n; j++) $4 = $4 "," g[j]
        }
        # Print the original or updated line.
        print
}'  data

jacksolm · May 19, 2013, 12:10pm

Now, I see the missing piece. I was missing the algorithm to remove an item from the middle of an array. In other words:

To delete an element from the middle of an array:

1. There is no need to do anything with the element deleted.
2. Iterate along the array from 1 after the deleted element, to the last element.
3. Copy each element into the location 1 before it.
4. Set the last element to null.

Thanks again for sharing your knowledge and expertise!

Don_Cragun · May 19, 2013, 12:38pm

jacksolm:

Now, I see the missing piece. I was missing the algorithm to remove an item from the middle of an array. In other words:

To delete an element from the middle of an array:
1. There is no need to do anything with the element deleted.
2. Iterate along the array from 1 after the deleted element, to the last element.
3. Copy each element into the location 1 before it.
4. Set the last element to null.
Thanks again for sharing your knowledge and expertise!

You're welcome. You almost described what the script is doing. Rather than setting the last element to null, the script I provided just reduces the number of elements to be processed in the array (that is what the n-- does). The former last element is still there unchanged, but it will be ignored when updating the last field just before printing the modified line. (Note that all of the elements in the array will be deleted and replaced with new elements the next time split() is called when awk is processing the next line.)