In recent years, we have seen an increased interest in big data analysis. Executives, managers, and other business stakeholders use Business Intelligence (BI) to make informed decisions. It allows them to analyse critical information immediately, and make decisions based not only on their intuition but on what they can learn from their customers’ real behaviour.
When you decide to create an effective and informative BI solution, one of the very first steps that your development team needs to make is to plan the data pipeline architecture. There are several cloud-based tools that can be applied to build such a pipeline, and there is no one solution that would be the best for all businesses. Before you decide on a particular option, you should consider your current tech stack, pricing of tools, and the skill set of your developers. In this article, I will show an architecture built with AWS tools that has been successfully deployed as a part of Timesheets application.
Timesheets is a tool to track and report employee time. It can be used via web, iOS, Android and desktop applications, chatbot integrated with Hangouts and Slack, and action on Google Assistant. Since there are many types of apps available, there are also a lot of diverse data to track. The data are collected via Revolt Analytics, stored in Amazon S3, and processed with AWS Glue and Amazon SageMaker. The results of the analysis are stored in Amazon RDS and are used to build visual reports in Google Data Studio. This architecture is presented in the graph above.
In the following paragraphs, I will briefly describe each of the Big Data tools used in this architecture.
Revolt can be installed on any infrastructure you choose. This approach gives you total control over costs and tracked events. In the Timesheets case presented in this article, it was built on AWS infrastructure. Thanks to full access to the data storage, product owners can easily get insights into their application and use that data in other systems.
Revolt SDKs are added to every component of the Timesheets’ system, which consists of:
- Android & iOS apps (built with Flutter)
- Desktop app (built with Electron)
- Web app (written in React)
- Backend (written in Golang)
- Hangouts and Slack online chats
- Action on Google Assistant
Revolt gives Timesheets administrators knowledge about devices (e.g. device brand, model) and systems (e.g. OS version, language, timezone) used by the app’s customers. Furthermore, it sends various custom events associated with users’ activity in the apps. Consequently, the administrators can analyse user behaviour, and better understand their objectives and expectations. They can also verify the usability of the implemented features, and assess if these features meet the Product Owner’s assumptions about how they would be used.
AWS Glue is an ETL (extract, transform, and load) service that helps prepare data for analytical tasks. It runs ETL jobs in an Apache Spark serverless environment. Usually, it consists of the following three elements:
- Crawler definition – A crawler is used to scan data in all kinds of repositories and sources, classify it, extract schema information from them, and store the metadata about them in the Data Catalog. It can, for example, scan logs stored in JSON files on Amazon S3, and store their schema information in the Data Catalog.
- Job script – AWS Glue jobs transform data into the desired format. AWS Glue can automatically generate a script to load, clean, and transform your data. You can also provide your own Apache Spark script written in Python or Scala that would run the desired transformations. They could include tasks like handling null values, sessionization, aggregations, etc.
- Triggers – Crawlers and jobs can be run on demand, or can be set up to start when a specified trigger occurs. A trigger can be a time-based schedule or an event (e.g. a successful execution of a specified job). This option gives you the ability to effortlessly manage data freshness in your reports.
In our Timesheets architecture, this part of the pipeline presents as follows:
- A time-based trigger starts a preprocessing job, which executes data cleaning, assigns event logs appropriate to sessions, and calculates initial aggregations. The resulting data of this job is stored on AWS S3.
- The second trigger is set up to run after the complete and successful execution of the preprocessing job. This trigger starts a job that prepares data which is directly used in the reports analyzed by the Product Owners.
- The results of the second job are stored in an AWS RDS database. This makes them easily accessible and usable in Business Intelligence tools like Google Data Studio, PowerBI, or Tableau.
Amazon SageMaker provides modules to build, train, and deploy machine learning models.
It allows for training and tuning models at any scale and enables the usage of high-performance algorithms provided by AWS. Nonetheless, you can also use custom algorithms after you provide a proper docker image. AWS SageMaker also simplifies hyperparameter tuning with configurable jobs that compare metrics for different sets of model parameters.
In Timesheets, SageMaker Notebook Instances help us explore the data, test ETL scripts, and prepare prototypes of visualisation charts to be used in a BI tool for report creation. This solution supports and improves the collaboration of data scientists as it ensures they work on the same development environment. Moreover, this helps to ensure that no sensitive data (which can be part of the output of the notebooks’ cells) is stored beyond AWS infrastructure because notebooks are stored only in AWS S3 buckets, and no git repository is needed to share work between colleagues.
Deciding which Big Data and Machine Learning tools to use is crucial in designing a pipeline architecture for a Business Intelligence solution. This choice can have a substantial impact on system capabilities, costs, and ease of adding new features in the future. AWS tools are certainly worth consideration, but you should select a technology that will suit your current tech-stack and the skills of your development team.