Abstract
A large part of data science projects is spent on data engineering. Especially in open data contexts, data quality issues are prevalent and are often tackled by non-professional programmers. We introduce and evaluate Jayvee, a domain-specific language for data engineering aimed at reducing barriers to building data pipelines. We show that a structured DSL can have positive effects on speed, ease of use, and quality for data engineering by non-professional developers. For this, we present an empirical quantitative study, in which we compare the performance of students as proxies for non-professional programmers using Jayvee with Python and Pandas. We search for reasons for the empirical findings using a follow-up interview study on how using a DSL changes how non-professional programmers build data pipelines. Participants solve a subset of tasks faster, more easily, and with higher quality when using Jayvee compared to Python. Interviewees describe tradeoffs regarding the DSL’s more limited features, stricter code structure, and explicit descriptions. Jayvee is found to be more approachable, which leads to a more guided development flow. New data engineering languages should provide good tooling and documentation, plan how to visualize intermediate data and consider new development workflows involving tools like ChatGPT to find adoption.
Keywords
Data engineering, domain-specific language, empirical study, evaluation, open data, programming language
Reference
Heltweg, P., Schwarz, G., Riehle, D. & Quast, F. (2025). An Empirical Study of Jayvee, a Domain-Specific Language for Data Engineering, on Understanding Data Pipeline Architectures. In Software: Practice and Experience, Wiley, forthcoming.
Download
Available on the Wiley website (local copy).
Leave a Reply