If you’re engaged in the exciting world of machine learning, congratulations! We’re at a pivotal moment in AI, with unprecedented access to data/compute and groundbreaking techniques that continue to surprise us regularly. But while you and I can’t wait to play with the next transformer/GPT-x/SSL/RL algorithm/model, we all understand the reality that 80% of the work in data analysis is in compiling curated, high quality data [1]. So, a core need that can outweigh other data activities is for a reliable data pipeline that can easily and securely retrieve data to feed to the ML models.
But we already know how to build this. For big data access, we have been turning to the likes of Apache Hadoop and Apache Spark and pooling data from many disparate data sources, ranging from CSV to Postgres to Salesforce to Kafka into data lakes and cloud data warehouses like Snowflake and Databricks. We’re now even adopting DevOps methodologies into MLOps to automate the processes and have been orchestrating the data acquisition activities with tools like Apache Airflow. So, what is the problem?
There are many, actually. First, most of these tools and techniques require data ingestion of some sort, meaning data must be moved from the source, like Salesforce, into say a data lake. This process has its benefits but comes with baggage: it is not real time; it makes copies of the data, which can be problematic from a privacy/data governance viewpoint (to support regulations like CCPA and GDPR); and, perhaps most importantly, requires us to write code to move the data. While we have connectors/providers for popular data sources and tools like Fivetran and DBT/Jinja to help build the ETL pipeline, the process still needs custom coding, especially for non-traditional data endpoints.
That by itself may not be a problem. Most data engineers can write custom code to connect to data sources, which typically provide well defined APIs via connectors, drivers, and web interfaces. Unfortunately though, not all of us data professionals have the expertise or the time. As the data analysis tools and tasks have rapidly evolved over the last few years, it’s increasingly common now to find people simultaneously taking on the roles of Data Scientist, Data Analyst, Data Engineer, ML Engineer, and Software Engineer (even if we can pin down the scopes of these titles!) [2]. For example, you could be a math major turned data scientist and SRE employee, tasked with writing Golang/C++ code to access object data from Microsoft Dynamics, and would prefer a Low-Code/No-Code way. Then there is the tedious task of making all the pipeline tools from different vendors/projects work together.
That is not all: data engineers are increasingly faced with the challenge of “democratizing data” to support employees in different roles needing access to the same data. Not everyone needs (or necessarily wants) access to all the data, and access control must cover not just the roles (who can access what) but also dynamic (like time of day) and data specific (like credit card numbers) access. This is especially critical when it relates to governance of protected information, like home addresses and patient records. Ingesting all the data into a lake or warehouse only compounds this problem since we must now deal with controlling access at every copy made of that data. Also, the closer the data is to the source (ideally just at the source), the more context the system will have about its lineage to effectively enforce access.
One tool to let the data stay where it lives and provide a simple and secure access (without changing the way you create or use the data itself) comes from Datafi, based in Seattle, WA. The idea is straightforward: guard each data source with an Edge Server that looks up a global policy before serving data. No copies are made of the data itself, and the Edge Server filters it based on user roles and resource permissions, as well as dynamic and content specific rules.
In the Datafi product, the policy and the rules are set by the data owner via a user interface; but clients, such as a Jupyter Notebook, retrieve the data as normal, and the Edge Servers filter the data transparently. Check out datafi.us for more information.
Before committing to a data pipeline architecture, it is worth evaluating the pros and cons of a solution based on cloud data warehouses/lakes against one that accesses the data directly. Keeping the data where it lives is quick, secure, and carries little up-front investment or risk because there are no custom implementations. Even for situations that need a traditional warehouse based setup, we can get the data pipeline going quickly using direct access to test things out before engaging on a longer term solution.
References
[1] MLOps: From Model-centric to Data-centric AI, Andrew Ng on YouTube
[2] Exploring The Evolving Role of Data Engineers, Data Engineering Podcast