Hello,
OK, that's fine - thanks for confirming that this is not directly related to any academic work or coursework that you need to complete.
So, one possible solution in sed
would be this:
sed 's/\(le;[0-9]*\)\.[0-9]/\1/g'
Here, we're using sed
's substitution command, s
, to search for the following pattern:
(le;[0-9]*).[0-9]
So we're looking for all occurrences of the characters le;
that are immediately followed by one or more digits, followed by a full stop, and another digit.
The first bit of this is quite straightforward - the age comes immediately after the sex of the person in the passenger list. And that will always (presumably, at any rate, given the age of the data) be marked down as "male" or "female" - in other words, it will always end with "le". And fields are separated with a semi-colon. So that's the reason to search for le;
here - to clearly mark the boundary between the gender, and the age.
Now you might be wondering what's with the brackets here. This has the effect of marking this part of the pattern we're searching for as a capture group. A capture group is set of characters in the pattern that we can refer to later by number. And not co-incidentally, the portion of the pattern inside the capture group is the one we'll want to keep (i.e. the bit before the decimal point).
After the brackets, we specify the full stop, and a single digit. That's important, because this further limits us to only matching the age, which seems only to be rounded to one single decimal point. Other floating-point numbers in your input, such as the ticket price, are rounded to two places, and so even if they were somehow preceded by le;
at some point in the input file, they still would not match this pattern.
Now, let's look at what we're going to replace this whole pattern with:
\1
That at first might seem very strange. We don't want to change every one of these ages to be 1, do we ? No, we don't - and you'll notice that we've escaped the 1 with a backslash. That's because, escaped like this and taken literally, numbers in a sed
pattern refer to capture groups.
So this is the reason we made part of our search pattern a capture group earlier. What we're saying to sed
here is to replace everything in the previously-given search pattern with just the contents of the first capture group, which, you may recall, was everything prior to the decimal point. The end result, then, is to chop the decimal portion off of all the ages in the input.
Let's see what happens if we run this against the sample data:
$ cat test.txt
PassengerId;Survived;Pclass;Name;Sex;Age;SibSp;Parch;Ticket;Fare;Cabin;Embarked
431;Yes;Bjornstrom-Steffansson, Mr.Mauritz Hakan;male;28.0;0;0;110564;26.55;C52;S
664;No;3;Coleff, Mr.Peju;male;36.0;0;0;349210;7.4958;;S
$ sed 's/\(le;[0-9]*\)\.[0-9]/\1/g' test.txt
PassengerId;Survived;Pclass;Name;Sex;Age;SibSp;Parch;Ticket;Fare;Cabin;Embarked
431;Yes;Bjornstrom-Steffansson, Mr.Mauritz Hakan;male;28;0;0;110564;26.55;C52;S
664;No;3;Coleff, Mr.Peju;male;36;0;0;349210;7.4958;;S
$
And there we go - the ages were converted to integers, and the other floating point number, the ticket price, was not affected at all.
Hope this helps ! If you have any questions, please let us know and we can take things from there.