cloud

Upload Data To BigQuery With A Smile

Big QueryWhat is BigQuery?

In our world of ‘Big Data’ it can be time consuming and expensive to query massive datasets without the right infrastructure. Google BigQuery solves this problem by enabling super-fast, SQL-like queries against append-only tables, using the processing power of Google’s infrastructure.

What You need to do?

  1. Move your data into BigQuery – This is what we will do in this post.
  2. Let Google BigQuery handle the hard work.
  3. Query your big data with a smile in this cost/effective way.

How to upload data to Big Query?

There are two main approaches: stream you data or upload it directly from Google cloud storage. Let’s have a look at the steps to leverage Google cloud storage in order to upload data into BigQuery.

The main steps you need to follow:

  1. You will need to prepare your data. In this stage, you need to analyze and think what will be the best format (both JSON and CSV are supported).
  2. In our example, we will show you how to work with CSV files and even better, we will upload them to Google Cloud Storage and later with a BigQuery job we will make sure our data is being pulled automatically into BigQuery.
  3. Run a ‘sanity’ check to see that the new data is in good shape (optional step).

Tips

  • Upload your the data to a project with a good name (The default project names are not too clear in most cases).
  • Consider breaking your data (e.g monthly tables instead of a unique big one) because it will make life easier in the future to update, query and maintain the data source.
  • Have an example dataset with data that reflect the popular cases. This could be great to give developer an option to ‘play’ with the data and see its value.
  • Think on some good and bold example. A few sample queries are crucial to get people started on a dataset.


big query

 

Code

#!/bin/bash
DATA=$HOME/archive
BASE=`pwd`
if [ -n "$1" ]; then
archive=$1
# In case we upload the mobile data to BQ
if [[ $archive == *mobile* ]]; then
mobile=1
adate=${archive#mobile_}
else
# In all other cases, we upload the 'web' data to BQ
mobile=0
adate=$archive
fi
echo "Processing $adate, mobile: $mobile, archive: $archive"
else
echo "Must provide date, eg. Apr_15_2013"
exit
fi
mkdir -p $DATA/processed/$archive
echo -e "Downloading data for $archive"
cd $DATA
# Here you can download your data from your API
# Or in case you have the data locally - just ignore the wget lines.
wget -nv -N "http://www.archive.org/download/httparchive_downloads_${adate}/httparchive_${archive}_pages.csv.gz"
wget -nv -N "http://www.archive.org/download/httparchive_downloads_${adate}/httparchive_${archive}_requests.csv.gz"
if [ ! -f processed/${archive}/pages.csv.gz ]; then
echo -e "Converting pages data"
gunzip -c "httparchive_${archive}_pages.csv.gz" \
| sed -e 's/\\N,/"",/g' -e 's/^M//g' -e 's/\\N$/""/g' -e 's/\\"/""/g' -e 's/\\"","/\\\\","/g' \
| gzip > "processed/${archive}/pages.csv.gz"
else
echo -e "Pages data already converted, skipping."
fi
if ls processed/${archive}/requests_* &> /dev/null; then
echo -e "Request data already converted, skipping."
else
echo -e "Converting requests data"
# Ilya magic... But the important part is to split the data into files 🙂
gunzip -c "httparchive_${archive}_requests.csv.gz" \
| sed -e 's/\\N,/"",/g' -e 's/\\N$/""/g' -e 's/\\"/""/g' -e 's/\\"","/\\\\","/g' \
| split --lines=8000000 --filter='pigz - > $FILE.gz' - processed/$archive/requests_
fi
cd processed/${archive}
table=$(date --date="$(echo $adate | sed "s/_/ /g" -)" "+%Y_%m_%d")
ptable="runs.${table}_pages"
rtable="runs.${table}_requests"
echo -e "Syncing data to Google Storage - Ya! It's that easy."
gsutil cp -n * gs://httparchive/${archive}/
if [[ $mobile == 1 ]]; then
ptable="${ptable}_mobile"
rtable="${rtable}_mobile"
#nosync="--nosync"
nosync=""
else
nosync=""
fi
echo -e "Submitting new pages import (${ptable}) to BigQuery. Simple. Please smile now."
bq --nosync load $ptable gs://httparchive/${archive}/pages.csv.gz $BASE/schema/pages.json
first=1
for f in `ls -r requests_*`; do
if [[ $first == 1 ]]; then
echo "Submitting new requests import (${rtable}) to BigQuery: $f"
bq $nosync load $rtable gs://httparchive/${archive}/$f $BASE/schema/requests.json
first=0
else
echo "Submitting append requests import (${rtable}) to BigQuery: $f"
bq --nosync load $rtable gs://httparchive/${archive}/$f
fi
done
cd $BASE
echo "Done! --> All credit to: igrigorik! The source can be found: https://github.com/igrigorik/httparchive-bigquery"
view raw import-2-bq.sh hosted with ❤ by GitHub

Other posts on BigQuery

Scripts to upload data to BigQuery

Advertisement
Standard

One thought on “Upload Data To BigQuery With A Smile

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s