Removing decimals using sed

Panri93 · April 22, 2021, 10:39am

Hi guys, I'm new in using sed command.

I have the following dataset:

PassengerId;Survived;Pclass;Name;Sex;Age;SibSp;Parch;Ticket;Fare;Cabin;Embarked
431;Yes;Bjornstrom-Steffansson,  Mr.Mauritz Hakan;male;28.0;0;0;110564;26.55;C52;S
664;No;3;Coleff, Mr.Peju;male;36.0;0;0;349210;7.4958;;S

As you can see, the attribute Age has its values represented as decimals. I want to remove the decimals and leave the values as integers. How could I do this using sed?

Thanks in advance

drysdalk · April 22, 2021, 10:47am

Hi,

Can I just check one thing here before we proceed: you've recently asked a separate question about a very different data set, and that one was specifically about awk rather than about sed, as this one is. Are these questions related to academic work or coursework at all ? If they are that's absolutely fine, but we really should know that up-front before anyone proceeds any further here, since there are rules about what kind of assistance we can give you if we are ultimately talking about coursework.

Panri93 · April 22, 2021, 11:01am

Hi drysdalk, of course. I'm exploring the dataset titanic-passengers.csv for training purposes with sed. Since I'm new at it, it is difficult to find info about it.

drysdalk · April 22, 2021, 1:13pm

Hello,

OK, that's fine - thanks for confirming that this is not directly related to any academic work or coursework that you need to complete.

So, one possible solution in sed would be this:

sed 's/\(le;[0-9]*\)\.[0-9]/\1/g'

Here, we're using sed's substitution command, s, to search for the following pattern:

(le;[0-9]*).[0-9]

So we're looking for all occurrences of the characters le; that are immediately followed by one or more digits, followed by a full stop, and another digit.

The first bit of this is quite straightforward - the age comes immediately after the sex of the person in the passenger list. And that will always (presumably, at any rate, given the age of the data) be marked down as "male" or "female" - in other words, it will always end with "le". And fields are separated with a semi-colon. So that's the reason to search for le; here - to clearly mark the boundary between the gender, and the age.

Now you might be wondering what's with the brackets here. This has the effect of marking this part of the pattern we're searching for as a capture group. A capture group is set of characters in the pattern that we can refer to later by number. And not co-incidentally, the portion of the pattern inside the capture group is the one we'll want to keep (i.e. the bit before the decimal point).

After the brackets, we specify the full stop, and a single digit. That's important, because this further limits us to only matching the age, which seems only to be rounded to one single decimal point. Other floating-point numbers in your input, such as the ticket price, are rounded to two places, and so even if they were somehow preceded by le; at some point in the input file, they still would not match this pattern.

Now, let's look at what we're going to replace this whole pattern with:

\1

That at first might seem very strange. We don't want to change every one of these ages to be 1, do we ? No, we don't - and you'll notice that we've escaped the 1 with a backslash. That's because, escaped like this and taken literally, numbers in a sed pattern refer to capture groups.

So this is the reason we made part of our search pattern a capture group earlier. What we're saying to sed here is to replace everything in the previously-given search pattern with just the contents of the first capture group, which, you may recall, was everything prior to the decimal point. The end result, then, is to chop the decimal portion off of all the ages in the input.

Let's see what happens if we run this against the sample data:

$ cat test.txt
PassengerId;Survived;Pclass;Name;Sex;Age;SibSp;Parch;Ticket;Fare;Cabin;Embarked
431;Yes;Bjornstrom-Steffansson,  Mr.Mauritz Hakan;male;28.0;0;0;110564;26.55;C52;S
664;No;3;Coleff, Mr.Peju;male;36.0;0;0;349210;7.4958;;S
$ sed 's/\(le;[0-9]*\)\.[0-9]/\1/g' test.txt
PassengerId;Survived;Pclass;Name;Sex;Age;SibSp;Parch;Ticket;Fare;Cabin;Embarked
431;Yes;Bjornstrom-Steffansson,  Mr.Mauritz Hakan;male;28;0;0;110564;26.55;C52;S
664;No;3;Coleff, Mr.Peju;male;36;0;0;349210;7.4958;;S
$

And there we go - the ages were converted to integers, and the other floating point number, the ticket price, was not affected at all.

Hope this helps ! If you have any questions, please let us know and we can take things from there.

Scrutinizer · April 22, 2021, 3:28pm

One thing to note is that the first record in the sample dataset is missing the 3rd field.
If the number of fields are correct, with GNU sed, alternatively you could correct only the 6th field, for example like this:

sed -r 's/(\.[0-9]*)?;/;/6' test.txt

This way you can select all fields except the last one, since it is not terminated with a semicolon. This could be catered for like so:

sed -r 's/(\.[0-9]*)?(;|$)/\2/6' test.txt