Building Your First ETL Pipeline with Bash

Building Your First ETL Pipeline with Bash
Image by Author | Midjourney & Canva

Introduction

ETL, or Extract, Transform, Load, is a necessary data engineering process, which involves extracting data from various sources, converting it into a workable form, and moving it to some destination, such as a database. ETL pipelines automate this process, making sure that data is processed in a consistent and efficient manner, which provides a framework for tasks like data analysis, reporting, and machine learning, and ensures data is clean, reliable, and ready to use.

Bash, short for short for Bourne-Again Shell — aka the Unix shell — is a powerful tool for building ETL pipelines, due to its simplicity, flexibility, and extremely wide applicability, and thus it is an excellent option for novices and seasoned pros alike. Bash scripts can do things like automate tasks, move files around, and talk to other tools on the command line, meaning that it is a good choice for ETL work. Moreover, Bash is ubiquitous on Unix-like systems (Linux, BSD, macOS, etc.), so it is ready to use on most such systems with no extra work on your part.

This article is intended for beginner and practitioner data scientists and data engineers who are looking to build their first ETL pipeline. It assumes a basic understanding of the command line and aims to provide a practical guide to creating an ETL pipeline using Bash.

The goal of this article is to guide readers through the process of building a basic ETL pipeline using Bash. By the end of the article, readers will have a working understanding of implementing an ETL pipeline that extracts data from a source, transforms it, and loads it into a destination database.

Setting Up Your Environment

Before we begin, ensure you have the following:

  • A Unix-based system (Linux or macOS)
  • Bash shell (usually pre-installed on Unix systems)
  • Basic understanding of command-line operations

For our ETL pipeline, we will need these specific command line tools:

  • curl
  • jq
  • awk
  • sed
  • sqlite3

You can install them using your system's package manager. On a Debian-based system, you can use apt-get:

sudo apt-get install curl jq awk sed sqlite3

On macOS, you can use brew:

brew install curl jq awk sed sqlite3  

Let's set up a dedicated directory for our ETL project. Open your terminal and run:

mkdir ~/etl_project  cd ~/etl_project  

This creates a new directory called etl_project and navigates into it.

Extracting Data

Data can come from various sources such as APIs, CSV files, or databases. For this tutorial, we'll demonstrate extracting data from a public API and a CSV file.

Let's use curl to fetch data from a public API. For example, we'll extract data from a mock API that provides sample data.

# Fetching data from a public API  curl -o data.json "https://api.example.com/data"

This command will download the data and save it as data.json.

We can also use curl to download a CSV file from a remote server.

# Downloading a CSV file  curl -o data.csv "https://example.com/data.csv"

This will save the CSV file as data.csv in our working directory.

Transforming Data

Data transformation is necessary to convert raw data into a format suitable for analysis or storage. This may involve parsing JSON, filtering CSV files, or cleaning text data.

jq is a powerful tool for working with JSON data. Let's use it to extract specific fields from our JSON file.

# Parsing and extracting specific fields from JSON  jq '.data[] | {id, name, value}' data.json > transformed_data.json

This command extracts the id, name, and value fields from each entry in the JSON data and saves the result in transformed_data.json.

awk is a versatile tool for processing CSV files. We'll use it to extract specific columns from our CSV file.

# Extracting specific columns from CSV  awk -F, '{print $1, $3}' data.csv > transformed_data.csv

This command extracts the first and third columns from data.csv and saves them in transformed_data.csv.

sed is a stream editor for filtering and transforming text. We can use it to perform text replacements and clean up our data.

# Replacing text in a file  sed 's/old_text/new_text/g' transformed_data.csv

This command replaces occurrences of old_text with new_text in transformed_data.csv.

Loading Data

Common destinations for loading data include databases and files. For this tutorial, we'll use SQLite, a commonly used lightweight database.

First, let's create a new SQLite database and a table to hold our data.

# Creating a new SQLite database and table  sqlite3 etl_database.db "CREATE TABLE data (id INTEGER PRIMARY KEY, name TEXT, value REAL);"

This command creates a database file named etl_database.db and a table named data with three columns.

Next, we'll insert our transformed data into the SQLite database.

# Inserting data into SQLite database  sqlite3 etl_database.db <<EOF  .mode csv  .import transformed_data.csv data  EOF

This block of commands sets the mode to CSV and imports transformed_data.csv into the data table.

We can verify that the data has been inserted correctly by querying the database.

# Querying the database  sqlite3 etl_database.db "SELECT * FROM data;"

This command retrieves all rows from the data table and displays them.

Final Thoughts

We have covered the following steps while building our ETL pipeline with Bash, including:

  1. Environment setup and tool installation
  2. Data extraction from a public API and CSV file with curl
  3. Data transformation using jq, awk, and sed
  4. Data loading in an SQLite database with sqlite3

Bash is a good choice for ETL due to its simplicity, flexibility, automation capabilities, and interoperability with other CLI tools.

For further investigation, think about incorporating error handling, scheduling the pipeline via cron, or learning more advanced Bash concepts. You may also wish to investigate alternative transformation apps and methods to increase your pipeline skillset.

Try out your own ETL projects, putting what you have learned to the test, in more elaborate scenarios. With some luck, the basic concepts here will be a good jumping-off point to more complex data engineering tasks.

Matthew Mayo (@mattmayo13) holds a Master's degree in computer science and a graduate diploma in data mining. As Managing Editor, Matthew aims to make complex data science concepts accessible. His professional interests include natural language processing, machine learning algorithms, and exploring emerging AI. He is driven by a mission to democratize knowledge in the data science community. Matthew has been coding since he was 6 years old.

More On This Topic

  • ETL vs ELT: Which One is Right for Your Data Pipeline?
  • Building a Scalable ETL with SQL + Python
  • Building a Tractable, Feature Engineering Pipeline for Multivariate…
  • Building a Formula 1 Streaming Data Pipeline With Kafka and Risingwave
  • Building and Training Your First Neural Network with TensorFlow and Keras
  • Step-by-Step Tutorial to Building Your First Machine Learning Model
Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments

Latest stories

You might also like...