Planning on writing a Guide to Working with Large Datasets

In a recent research experiment I was handling, I faced this task of managing huge amounts of data to the order of Terabytes and with the help of many people here, I managed to learn quite a lot of things in the whole process. I am sure that many people will keep facing these situations quite often so I am planning on writing a general purpose guide on how to go about handling large amounts of data. Please note the following before reading further:

  1. This guide will not intended for a specific dataset but one or two tips might be definitely of use to you. :slight_smile:
  2. Some (or most... depends on what level you are) of the tips may apply to the absolute beginner
  3. If you have some feedback, please don't hesitate to give your suggestions because I realized that if not for the tricks I learnt in this forum, I would've wasted hundreds of man hours.
  4. I will try my level best to provide with some concrete examples whenever possible but if you find an error somewhere, kindly let me know.
  5. Lastly, as I said, now all this information is mine, some of it was collected from various sources during my work and some of it was attained with the kind help of people here and some of it was through my experience.

The following is the excerpt of the Table Of Contents that I am planning to have in the guide:

Table of Contents

  1. Introduction
  2. Meet your friends - Discover the purpose of each tool
  • PuTTY
  • Screen
  • Bash Scripting
  • Awk
  • Sed
  • Perl
  • PHP
  1. Extremely Useful Commands
  2. Some Concepts you ought to know
  3. Know your enemies - Have the constrains in mind
  4. Downloading and Storing Huge Amounts of Data - Do it carefully or you'll be banned!
  5. Database or not? - Is all the effort really worth it?
  6. Parsing the Mammoth - The time has finally come
  7. Last Minute tips for a Multiprocessor Environment
  8. Things to Avoid - Bust the common myths

I am pretty much open and I would really love some feedback on adding/deleting some topics to the above list.

Any use of Python?

Actually I was thinking even PHP was not necessary but that being my core expertise, I thought I'd cover where it would be useful. Perl is more regex centric and so it seems to suffice for most large dataset processing but if anyone is kind enough to explain the power of Python, that would be great too! :slight_smile: