2019
Modifying Rowkey (Schema) in Bigtable using Dataflow
Cloud Bigtable is a petabyte-scale, fully managed NoSQL database service in GCP for large analytical and operational workloads. It supports the open source industry standard HBase API, and has integrations with GraphDBs, TSDBs, Geospatial DBs ( link ). Actually, Bigtable was initially released in 2005, but wasn’t available to general public until 2015. Apache HBase was created based on Google’s publication Bigtable: A Distributed Storage System for Structured Data with initial release in 2008.
Running Spark on Dataproc and loading to BigQuery using Apache Airflow
Apache Airflow is an popular open-source orchestration tool having lots of connectors to popular services and all major clouds. This blog post showcases an airflow pipeline which automates the flow from incoming data to Google Cloud Storage, Dataproc cluster administration, running spark jobs and finally loading the output of spark jobs to Google BigQuery.
2018
Convert CSV to Parquet on Cloud Dataproc using Hive External table
We were recently working with a leading international voice carrier firm headquartered in US, which wanted to build a Data Warehouse on Google BigQuery.