HTML5, JavaScript, webdev

Big Query Power With JavaScript

Big Query and App script logoThis week on Google developers live Israel we wanted to show the power of Big Query. What is Big Query? Well, in todays world when everyone like to use the term “big data” you need to have the capabilities to querying massive datasets. This can be time consuming and expensive without the right knowledge, hardware and infrastructure. Google BigQuery solves this problem by enabling super-fast, SQL-like queries against append-only tables, using the processing power of Google’s infrastructure. In order to get started quickly and ‘test the water’ there is a powerful online tool that let you query pre-existing datasets like: wikipedia, Github etc’. If you like to type in command line, there is also a command line tool. Before you start your first project you should signup for BigQuery (yes! it’s open now for all). You should log in to the Google APIs Console and make sure you set a new project and allow Big Query API on it. You should also, enable billing if you have not done so in the past. Lastly, head to and click on one of the public datasets that are on the left sidebar.

big query public data sources

After you select a source – feel free to test it with the browser tool and see how the data looks like. For example: If you wish to see what are the top 10 revised articles on the english version of wikipedia type:

SELECT TOP(title, 10) as title, COUNT(*) as revision_count FROM [publicdata:samples.wikipedia] WHERE wp_namespace = 0;

These are the results I got on 24/7/2013:

Row title revision_count
1 George W. Bush 43,652
2 List of World Wrestling Entertainment employees 30,572
3 Wikipedia 29,726
4 United States 27,433
5 Michael Jackson 23,245
6 Jesus 21,767
7 Deaths in 2009 20,695
8 World War II 20,522
9 Britney Spears 20,459
10 Wii 20,199

HttpArchive is now part of Big Query


I’ve been helping this wonderful open-source project in the past and I passion about the information its contain and the way it allow developers to access it. The site provides a number interesting stats and trends, but the data on the site only scratches the surface. Since the data (which you can download) is now around 400GB, it won’t be a ‘walk in the park’ to export-import it to your own database. Thanks to big query and Ilya Grigorik we have now the option to ask questions on this dataset and get results quickly. Moreover, with big query API and App Script code we can see trends and have queries that are more complicated. In order to have a pointer to the data and to work with it, you should click down the arrow beside “API Project”: Switch to project -> Display project -> enter “httparchive”. Next step, open a new google sheet and go to ‘Tools’ -> ‘Script Editor’. You can work with this example code that will give you a start for working with big query.


Google Developer Live Israel

Got interesting questions for httparchive?

There are many options to answer you questions. The powerful aspect of Apps Script, is in the automation. We can run a job and later get the results in our google sheet. So we have a nice way to share these results with co-workers, friends etc’. We can take it a step forward, and have a customize trigger that will run our ‘dashboard’ on a daily bases, crunch the numbers and email us the results. Few examples for interesting queries on http archive:

  1. How fast is G+ / FB  / Twitter  over time?
  2. What are the top 10 sites that are being used by others?
  3. What is the most popular JS framework?
  4. Which site from the top 100 is the slower? Does it correlate with its size?
  5. What is the correlation between JS, CSS sizes and the site speed?
  6. What is the % of error pages on the top 25 fastest sites in the world?

Code Example(s)

function runQueryThirdPartyUsage(startRow, startCol) {
if (startRow === undefined) startRow = 25;
if (startCol === undefined) startCol = 1;
var projectNumber = 'bq-httparchive-1';
var sheet = SpreadsheetApp.getActiveSheet();
var dates = ['2013_06','2013_05','2013_04','2013_03','2013_02','2013_01',
var sql = 'SELECT date, third_party, num_requests FROM ';
for (var i=0; i<dates.length; i++) {
sql += '(SELECT "'+dates[i]+'" date, DOMAIN(req.url) third_party, COUNT(*) num_requests FROM [httparchive:runs.'+dates[i]+'_01_requests] as req JOIN ( \
SELECT DOMAIN(url) self, pageid \
FROM [httparchive:runs.'+dates[i]+'_01_pages] \
) as pages ON pages.pageid = req.pageid \
WHERE DOMAIN(req.url) != pages.self \
GROUP BY third_party ORDER BY num_requests desc LIMIT 10),'
var queryResults;
// Inserts a Query Job
try {
Logger.log("Our SQL: " + sql);
queryResults = BigQuery.Jobs.query(projectNumber, sql);
catch (err) {
// Check on status of the Query Job
while (queryResults.getJobComplete() == false) {
try {
queryResults = BigQuery.Jobs.getQueryResults(projectNumber, queryJob.getJobReference().getJobId());
catch (err) {
// Update the amount of results
var resultCount = queryResults.getTotalRows();
var resultSchema = queryResults.getSchema();
var resultValues = new Array(resultCount);
var tableRows = queryResults.getRows();
// Iterate through query results
for (var i = 0; i < tableRows.length; i++) {
var cols = tableRows[i].getF();
resultValues[i] = new Array(cols.length);
// For each column, add values to the result array
for (var j = 0; j < cols.length; j++) {
resultValues[i][j] = cols[j].getV();
// Update the Spreadsheet with data from the resultValues array, starting from cell A1
sheet.getRange(startRow, startCol, resultCount, tableRows[0].getF().length).setValues(resultValues);
Browser.msgBox("Done! All the answers should be on the sheet");
SELECT pages.pageid, url, cnt, libs, pages.rank rank FROM [httparchive:runs.2013_06_01_pages] as pages JOIN (
SELECT pageid, count(distinct(type)) cnt, GROUP_CONCAT(type) libs FROM (
SELECT REGEXP_EXTRACT(url, r'(dojo|angular|prototype|backbone|emberjs|sencha|scriptaculous).*\.js') type, pageid
FROM [httparchive:runs.2013_06_01_requests]
WHERE REGEXP_MATCH(url, r'dojo|angular|prototype|backbone|emberjs|sencha|scriptaculous.*\.js')
GROUP BY pageid, type
GROUP BY pageid
HAVING cnt >= 2
) as lib ON lib.pageid = pages.pageid
ORDER BY rank asc
SELECT DOMAIN(req.url) third_party, COUNT(*) num_requests
FROM [httparchive:runs.2013_06_01_requests] as req JOIN (
SELECT DOMAIN(url) self, pageid
FROM [httparchive:runs.2013_06_01_pages]
) as pages ON pages.pageid = req.pageid
WHERE DOMAIN(req.url) != pages.self
GROUP BY third_party
ORDER BY num_requests desc LIMIT 10;



4 thoughts on “Big Query Power With JavaScript

  1. Pingback: Big Query And App Script (GDL Israel) - TI Nacional

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s