In a modern organisation, more and more data is produced. The trick is to filter it, enrich it and then analyse it to see what has happened, but above all to decide what needs to be done. A well-thought-out data strategy and a flexibly designed data structure are essential. Daan Stroosnier and Mees Strooker talk about how e-scooter rental company Felyx redesigned its data strategy in the webinar 'Felyx: Clearing the road for Sustainable Data-ops Driven Architectures'.
Felyx is a fast-growing Dutch start-up (2017) that rents out electric shared scooters. Customers can sign up via an app, after which it shows where the nearest e-scooter is located. Felyx is now active in twelve cities in the Netherlands, Belgium and Germany, rents out around 6000 e-scooters, and has over 100,000 active monthly users who together have travelled 15 million kilometres to date. The company collects data in various ways, which Felyx uses to optimise customer service but also to make strategic decisions. How has Felyx determined its data strategy and how is the data infrastructure set up?
Multiple data sources
An important part of the data comes from the Felyx app. "This mainly involves customer data, such as the use of the app and the e-scooter," says Daan Stroosnier, Global Head of Data Analytics. "Supplemented by things like referral promotions, discounts and the like." Another source of data are the e-scooters, in fact IoT devices that constantly send information to Felyx. "The battery of the e-scooter is a separate IoT device," Stroosnier explains. "This sends information about its charge and condition to us, so we know when it needs to be changed."
The company's own technicians and fleet managers, who are the 'service agents' at Felyx, use their own front-end as an operational system. Here, for example, service tickets can also be seen, in which the state of maintenance of both e-scooter and battery is tracked.
Finally, Stroosnier mentions external data sources. "You can imagine that weather information is important to us. In bad weather, fewer kilometres are driven. Market information about our competitors is also stored in our database. It is our job as database engineers to make all this data accessible and to ensure that as little human input is required as possible as much automation takes place as possible. For this, it is important to collect all this data, and especially to ensure a good data infrastructure."
From legacy to state-of-the-art
The original data infrastructure mainly grew organically. "Before I joined Felyx, there was a student who had set up the entire data warehouse and associated infrastructure on his own," explains Data Engineer Mees Strooker. "He did a great job, but in the end this structure was not sustainable and we had to virtually start from scratch." An application was created for each strategic and operational need, which eventually led to an impenetrable tangle of applications. Eventually one of the biggest problems turned out to be the scalability of the PostgreSQL database, which was particularly noticeable in its reduced performance. "That meant a lot of lost time and maintenance, and analytical queries also started to become slow."
The old situation consisted of ETL processes that processed data in three ways
- Via 3rd party APIs
- Via OLTP snapshots
- Via the event queue
The ETL processes were hosted on Heroku, which works with isolated containers called 'dynos' that do not scale in a very expedient way. Horizontally is not such a problem, as additional containers can be added. Vertically it’s more difficult, because when more resources are needed for a specific dyno, a whole group may have to be upgraded at once, which causes higher costs than necessary. There were also problems with the event queue when an event schedule change occurred.
The debugging and fault trapping was poorly documented and every adjustment made was directly to the live data. This meant that when something went wrong, it was not possible to go back, because the source data was gone. There was also no validation of the data that was passed on for analysis.
Felyx decided to change the entire data structure and defined a number of essential starting points for this:
•Scalability and elasticity
•Change is the only constant
•Improved debugging possibilities
•From ETL to ELT
"In practice, this means that for every tool we want to deploy, we ask ourselves 'can we go back if needed?'" says Strooker. "And as far as the Unix philosophy is concerned, we worked with a proof of concept or a minimum viable product for each module, so that in the end we had the choice between two or three options." The result is that different technologies all work together and better documentation especially saving a lot of time if something seems wrong.
What does this mean in practice? The raw data is stored in different cloud buckets:
•Streaming event data in Avro format
•OLTP database snapshots in CSV
•3rd party API calls in JSON
The event data is processed in a continuous Kubernetes pod, the snapshots and 3rd party API calls are processed via Airflow.
Finally, the data arrives in the Snowflake warehouse where validation is done by Apache Beam. The schema is managed by Liquibase.
"Apache Beam makes it possible to handle and validate a lot of data. Liquibase allows us to change schemas incrementally without breaking anything in the data."
Felyx has a data model with many layers. Not only to guarantee the quality of the data, but also to format the data in such a way that the data analysts and scientists can easily work with it. "They can create relational tables themselves," Strooker explains. "For our warehouse, we looked at several options. One was a data vault, but that proved to be overkill from a technical point of view."
In the end, the data engineers chose 3rd normal form with history. "This allows us to create enriched layers on top of the normalised tables." Strooker stresses that every choice is a compromise. "And we are not sure if we have chosen the best solution. But the important thing is that when we switch, it is possible to rebuild the whole data structure because we have the right tooling for that."
For the final processing of the data into usable information, Felyx uses Data Build Tool for managing enriched data models and updating data marts. Data analysts can therefore easily take over some of the tasks of data engineers, as they can work with simple SQL statements. Version control, testing and reusable code in the form of macros thus enter the domain of the analysts.
What happens behind DBT is just as important, says Strooker. "You create a transformation pipeline for the data using simple SQL statements. You can have documentation generated directly from this SQL code, so everyone understands what the code is doing." In addition, DBT offers various materialisation strategies, including tables, views, and ephemeral data in the form of CTEs. "Incremental is also possible, where you use DBT to check if there is new data that meets the query, without having to rebuild the entire dataset."
Where does all this normalised and enriched data go in the end? From the point of view of the data engineers the end users can be divided into three groups.
•Analytics translators - they use Metabase for descriptive and diagnostics analysis
•Analytics engineers - custom dashboards in RShiny and Dash, also for descriptive and diagnostics analysis
•Data scientists - use ML models for predictive and prescriptive analysis
Thanks to the entire operation, Felyx's data landscape now looks very different. "Our main goal was flexibility," says Strooker, "and we certainly achieved that." With the tools now chosen, change management is possible, and new data domains can be added and loaded. "The trick is the intensive use of templates, such as Apache Airflow DAGs, DBT models and Helm charts." This allows Felyx the flexibility to switch between different environments and make dynamic use of scripts.
Schema management is handled by Liquibase, allowing the entire data model to be securely modified when needed. And finally, Felyx makes use of Snowflake zero copy cloning. This allows testing with a cloned version of the database, which can simply be removed after use. This proves very effective for the CI/CD pipeline.