Business, webdev

New Data Stack Workshop: Building A Scalable Internet Datacenter

Here are few notes I took during yesterday conference in Stanford. The organizer (=Accel) did a very good job in bring some of the leading minds in the new emerging field of ‘Scalable’ and ‘Cloud’. Right… all of the speakers were ‘Accel companies’ – but they are the one who pushing the technology forward – so we cool with that.

The first company was NorthScale (and their chief architect is the same guy that we helped a year ago – with some real world data from high gear media servers). They have a very impressive open source project – Membase What is Membase you ask? Well, “Membase is an open-source (distributed, key-value database management system optimized for storing data behind interactive web applications. These applications must service many concurrent users; creating, storing, retrieving, aggregating, manipulating and presenting data in real-time. Supporting these requirements, membase processes data operations with quasi-deterministic low latency and high sustained throughput.”

In a nutshell:

  • Membase it’s like memcached but better.
  • Simple – get/set works with memcache client.
  • Fast – ram, ssd and disk layer.
  • Predictable latency.
  • Availability – All nodes are created equal.
  • Replication between data centers.

The second company was Cloudera – The bring to the table a full stack of Analytical Data Platform (ADP).

  • It’s not only a map/reduce solution but the full stack of ‘watching the wheels moving’ and understanding where you want to steer them.
  • BI is science for profit.
  • View of the world / answer question to make money
  • Pose hypothesis and validate them.
  • A/b testing – feeding the business with real world hypothesis and their verifications.
  • HDFS and map/reduce
  • Jim gray – ‘fourth paradigm’
  • HBase – base on google bigTable.
  • Hive – SQL intercede for Hadoop.
    What is Hadoop? Hadoop is the popular open-source implementation of MapReduce, a powerful tool designed for deep analysis and transformation of very large data sets
  • A good paper from Google: “The unreasonable effectiveness of Data’
  • Beautiful data – buy the book.

Last (for me) was Facebook. Here the story is very simple… scaling from 4-5M in 2006 up to 400M now is putting some challenges on the development team.

  • Scaling to 400M (efficiency is taking a hit at the beginning)
  • Need all the data all the time.
  • You / 100 friends / 100 objects-friends = 10’s of possible objects.
  • Web Server / memcached / MySQL – and make sure you can replicate boxes in each layer without any changes to your code.
  • Testing – Testing and some more unit testing, A/B Testing, System testing… you got the point.
  • Push a new version every week. Don’t let your software ‘get old and stick’.
  • Monitor EVERYTHING and when there is a problem always do what every you can in order to understand the root cause. Yes – even after you fixed it and everything is working now.
  • No single point of failure (sometimes – your software will be the single point of failure).

Overall, it was very productive 3h that will make us try few new opensource projects. Good times.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s