Jul 21

What is Hadoop

My recent post on Hadoop may leave people wondering “WTH is Hadoop?”.

Well first, if anything in the world can be called “Cloud Computing”, Hadoop can.

Hadoop is an open source software system that creates two things:

1) A highly scalable, fault-tolerant distributed file system (loosely based on the Google File System)

2) A highly scalable implementation of Google’s MapReduce algorithm

And it’s open source, and free, and has been in use at Facebook and Yahoo for several years now.

Your next question may be “What is MapReduce?”

MapReduce is an algorithm that splits a large amount of data into smaller chunks, and allows the data to be sorted and aggregated in various ways.   It’s one of the cornerstones of Google’s massive software infrastructure – a system that lets Google process all the data that comes in about who is linking to who, and which tags and text are being used, etc.

Essentially, Hadoop is a cloud-based data analysis tool – something that can scale very cost-effectively, and can chew through terabytes and/or petabytes of data using off-the-shelf computers with off-the-shelf operating systems and hardware.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>