Here are few notes I took during yesterday conference in Stanford. The organizer (=Accel) did a very good job in bring some of the leading minds in the new emerging field of ‘Scalable’ and ‘Cloud’. Right… all of the speakers were ‘Accel companies’ – but they are the one who pushing the technology forward – so we cool with that.
The first company was NorthScale (and their chief architect is the same guy that we helped a year ago – with some real world data from high gear media servers). They have a very impressive open source project – Membase What is Membase you ask? Well, “Membase is an open-source (distributed, key-value database management system optimized for storing data behind interactive web applications. These applications must service many concurrent users; creating, storing, retrieving, aggregating, manipulating and presenting data in real-time. Supporting these requirements, membase processes data operations with quasi-deterministic low latency and high sustained throughput.”
In a nutshell:
- Membase it’s like memcached but better.
- Simple – get/set works with memcache client.
- Fast – ram, ssd and disk layer.
- Predictable latency.
- Availability – All nodes are created equal.
- Replication between data centers.
The second company was Cloudera – The bring to the table a full stack of Analytical Data Platform (ADP).
- It’s not only a map/reduce solution but the full stack of ‘watching the wheels moving’ and understanding where you want to steer them.
- BI is science for profit.
- View of the world / answer question to make money
- Pose hypothesis and validate them.
- A/b testing – feeding the business with real world hypothesis and their verifications.
- HDFS and map/reduce
- Jim gray – ‘fourth paradigm’
- HBase – base on google bigTable.
- Hive – SQL intercede for Hadoop.
What is Hadoop? Hadoop is the popular open-source implementation of MapReduce, a powerful tool designed for deep analysis and transformation of very large data sets
- A good paper from Google: “The unreasonable effectiveness of Data’
- Beautiful data – buy the book.
Last (for me) was Facebook. Here the story is very simple… scaling from 4-5M in 2006 up to 400M now is putting some challenges on the development team.
- Scaling to 400M (efficiency is taking a hit at the beginning)
- Need all the data all the time.
- You / 100 friends / 100 objects-friends = 10’s of possible objects.
- Web Server / memcached / MySQL – and make sure you can replicate boxes in each layer without any changes to your code.
- Testing – Testing and some more unit testing, A/B Testing, System testing… you got the point.
- Push a new version every week. Don’t let your software ‘get old and stick’.
- Monitor EVERYTHING and when there is a problem always do what every you can in order to understand the root cause. Yes – even after you fixed it and everything is working now.
- No single point of failure (sometimes – your software will be the single point of failure).
Overall, it was very productive 3h that will make us try few new opensource projects. Good times.