This week on Google developers live Israel we wanted to show the power of Big Query. What is Big Query? Well, in todays world when everyone like to use the term “big data” you need to have the capabilities to querying massive datasets. This can be time consuming and expensive without the right knowledge, hardware and infrastructure. Google BigQuery solves this problem by enabling super-fast, SQL-like queries against append-only tables, using the processing power of Google’s infrastructure. In order to get started quickly and ‘test the water’ there is a powerful online tool that let you query pre-existing datasets like: wikipedia, Github etc’. If you like to type in command line, there is also a command line tool. Before you start your first project you should signup for BigQuery (yes! it’s open now for all). You should log in to the Google APIs Console and make sure you set a new project and allow Big Query API on it. You should also, enable billing if you have not done so in the past. Lastly, head to bigquery.cloud.google.com and click on one of the public datasets that are on the left sidebar.
After you select a source – feel free to test it with the browser tool and see how the data looks like. For example: If you wish to see what are the top 10 revised articles on the english version of wikipedia type:
SELECT TOP(title, 10) as title, COUNT(*) as revision_count FROM [publicdata:samples.wikipedia] WHERE wp_namespace = 0;
These are the results I got on 24/7/2013:
|1||George W. Bush||43,652|
|2||List of World Wrestling Entertainment employees||30,572|
|7||Deaths in 2009||20,695|
|8||World War II||20,522|
HttpArchive is now part of Big Query
I’ve been helping this wonderful open-source project in the past and I passion about the information its contain and the way it allow developers to access it. The site provides a number interesting stats and trends, but the data on the site only scratches the surface. Since the data (which you can download) is now around 400GB, it won’t be a ‘walk in the park’ to export-import it to your own database. Thanks to big query and Ilya Grigorik we have now the option to ask questions on this dataset and get results quickly. Moreover, with big query API and App Script code we can see trends and have queries that are more complicated. In order to have a pointer to the data and to work with it, you should click down the arrow beside “API Project”: Switch to project -> Display project -> enter “httparchive”. Next step, open a new google sheet and go to ‘Tools’ -> ‘Script Editor’. You can work with this example code that will give you a start for working with big query.
Google Developer Live Israel
Got interesting questions for httparchive?
There are many options to answer you questions. The powerful aspect of Apps Script, is in the automation. We can run a job and later get the results in our google sheet. So we have a nice way to share these results with co-workers, friends etc’. We can take it a step forward, and have a customize trigger that will run our ‘dashboard’ on a daily bases, crunch the numbers and email us the results. Few examples for interesting queries on http archive:
- How fast is G+ / FB / Twitter over time?
- What are the top 10 sites that are being used by others?
- What is the most popular JS framework?
- Which site from the top 100 is the slower? Does it correlate with its size?
- What is the correlation between JS, CSS sizes and the site speed?
- What is the % of error pages on the top 25 fastest sites in the world?
- Big Query starting page on developers.google.com
- App Script code to work with Big Query
- Ilya’s post on http archive and big query