cloud

Big Query Basics

Big QueryIn today talk, I’ve covered one of the coolest tools in the “big data” world – Big Query. What is BigQuery? Well, querying massive datasets can be time consuming and expensive without the right hardware and infrastructure. BigQuery solves this problem by enabling super-fast, SQL-like queries, using the processing power of Google’s infrastructure. If you wish to play with a cool a BQ demo, just click on the picture on the left.

When to use it?

There are few bold use cases in the analytic landscape that Big Query is the right tool for the job:

  • It’s a great option when your MapReduce based analysis (=Hadoop system) can be slow for ad-hoc queries.
  • If you don’t wish to manage large data centers and tuning Hadoop all day long.
  • It seems that analytics tools should be services at 2014, no?

Common use cases

  • Log Analysis – Making sense of computer generated records. Think about mobile games, application logs and any other files that contain terabytes of text data.
  • Shopping data and Retailers – Using data to forecast product sales.
  • Ads Targeting – Targeting proper customer sections.
  • Sensor Data – Collect and visualize ambient data. Who said Nest? or if you visit us at Google I/O 2013 – There was a big demo over there for analyzing big data that many sensors produced.
  • Data Mashup – Query terabytes of heterogeneous data from different sources.

Accessing BigQuery

One of the quickest ways to access BQ will be the browser tool, for the developers that like command line: try bq command-line tool or if you wish to code BigQuery API. It’s a RESTful API that let you access BigQuery programmatically. It requires authorization by OAuth2. You can use many Google client libraries for Python, Java, JavaScript, PHP and more technologies.

Best Practices

  • CSV/JSON must be split into chunks less than 1TB
    • Use the “split” command with –line-bytes option to break your data into smaller flies. It will enable you:
      • Easier error recovery
      • To smaller data unit (day, month instead of year)
  • Take a good sample of your data and try to upload this ‘test’ file first and see that you are not getting any errors. It will save you from little mistakes that might break your upload to BQ. You don’t want to find out that after 765G of data you getting ‘few’ errors and you need to repeat the push again.
  • Uploading to Cloud Storage is recommended. When you upload multiple files to cloud storage you will be able to push them in parallel into BQ.
  • Split Tables by Dates
    • Minimize cost of data scanned
    • Minimize query time
  • Run a count query to get an estimation on the size before you run your real queries.
  • Denormalize your data – Think on performance first.
  • You can use Google sheets and Apps script to create some powerful UI. Here are two posts I’ve wrote last year on this topic:

The slides

Be strong.

Standard

2 thoughts on “Big Query Basics

    • greenido says:

      BigQuery tables are append-only so if you would like to update or delete data you will need to delete the table, then recreate the table with new data. Alternatively, you could write a query that modifies the data and specify a new results table.

Leave a comment