The two steps in Dataform execution: Javascript, then SQL statements

Understanding SQLX and Javascript in Dataform

When you think of “programming languages for data,” Javascript probably doesn’t even make the list. Yet, if you’re using Dataform, it’s almost impossible to avoid. As you’ll learn in this article, Dataform is Javascript all the way down. Even SQLX files that you write are pretty much just Javascript files under the covers. So shed whatever fears you have about the confusingly named language, grab a cup of ☕, and get ready to get intimate with the finer points of how Dataform works.

First, the big picture

There are three things to know about how Dataform works with Javascript. First is that Dataform runs on Node.js, a JavaScript runtime environment. On top of that, the Google-hosted Dataform provides an even more restricted “sandboxed” environment that does not support additional functionality or modules provided by Node.js. So everything from the user interface to parsing SQL files to building a dependency tree happens in Javascript. That means if you want to play along, you’ll have to get on board.

Second, it’s important to note that Dataform runs all the Javascript in a project in a single run at the time of execution. At that time, Dataform parses all the SQLX files and runs all the Javascript in the /definitions and /includes directories that get called in execution. This produces a dependency tree, thanks to the ref function, and a sequence of SQL commands to be executed in the database.

The final preface is that Javascript is only writing the SQL statements, not actually performing operations on the data itself (like js user-defined functions in BigQuery.) You can think of it as dynamically writing SQL files that generate DDL and DML statements. Dataform then runs these statements in BigQuery (or another DB with the open-source version.) Hopefully, my 3D design skills illustrate that. 🙂

Here’s what that looks like in my head. I hope it makes sense to you, too…

Let’s get into a concrete example to see what this all means to you.

Example: Using Javascript in SQLX

To show how Dataform handles SQLX and Javascript, I made up a silly example that uses data from Google Analytics 4 export. Here’s a SQLX file that would return a database view containing data about all the pageview events since a given date.


-- File: definitions/pageviews.sqlx

config { 
    type: "view" 
}

js {
    const event_type = 'page_view'
}

select
    event_timestamp,
    user_pseudo_id,
    ${ utils.getEventParam('page_location', 'string') },
    ${ utils.getEventParam('page_referrer', 'string') },
    ${ utils.getEventParam('ga_session_id', 'int') },
from ${ ref('events_*') } pv
where event_name = '${ event_type }'
    and pv.event_date >= '${ constants.analysis_start_date }'

It’s a pretty simple SQLX file, but I’ve added some elements to it to illustrate how Javascript interacts with SQLX. There are a few things to take notice of in the file:

  • The javascript block (js) sets the event_type variable that gets referenced in the query’s WHERE clause.
  • Every reference to a Javascript function is enclosed in ${...}
  • The query uses the utils.getEventParam() function three times to write the SQL for repeated data transformations
  • The FROM clauses uses the ref() function to reference the definition’s one dependency
  • The WHERE clause also references a variable called constants.analysis_start_date

Each of these elements illustrates something interesting about the use of Javascript in Dataform. But before we get into that, let’s look at the compiled SQL after Dataform reads the file.

--  Compiled version of definitions/pageviews.sqlx

select
    event_timestamp,
    user_pseudo_id,
    (
      select ep.value.string_value AS page_location 
      from unnest(event_params) ep 
      where ep.key = 'page_location'
  ) AS page_location,
    (
      select ep.value.string_value AS page_referrer 
      from unnest(event_params) ep 
      where ep.key = 'page_referrer'
  ) AS page_referrer,
    (
      select ep.value.int_value AS ga_session_id 
      from unnest(event_params) ep 
      where ep.key = 'ga_session_id'
  ) AS ga_session_id,
from `example_database.analytics_12345678.events_*` pv
where event_name = 'page_view'
    and pv.event_date >= '20231001'

What if I told you SQLX is just a Javascript template literals?

Yep. That’s right. Once you understand that, everything starts to make a lot more sense.

Javascript template literals are string “literals delimited with backtick (`) characters, allowing for multi-line strings and string interpolation with embedded expressions.” Those two features (multi-line strings and string interpolation) are what make SQLX… SQLX.

That’s why variables are enclosed in ${}

If you’re unfamiliar with string interpolation, now would be a good time to read up. The short version is, “string interpolation … is the process of evaluating a string literal containing one or more placeholders, yielding a result in which the placeholders are replaced with their corresponding values.” In Javascript, template literals use the ${} syntax to embed Javascript expressions that are evaluated and return a result. In other words, that’s why select ${ 1 + 1 } will evaluate to select 2 in Dataform.

Now, it’s a lot easier to make sense of the first example above. There are several embedded expressions, including event_type, utils.getEventParam(), ref(), and constants.analysis_start_date . Note that two of the expressions, event_type and constants.analysis_start_date are just plain Javascript variables, and the other two are function calls that print out different results depending on their arguments.

Translating SQLX to Javascript

To clarify what Dataform is doing behind the scenes when it processes a SQLX file, let’s translate the example definition above to a Javascript implementation. It uses the publish() function to create the same example definition.

-- File: definitions/pageviews.js

const event_type = 'page_view'

publish('pageviews_js', { type: "view" }).query( ctx =>  
`
select
  event_timestamp,
  user_pseudo_id,
  ${utils.getEventParam('page_location', 'string')},
  ${utils.getEventParam('page_referrer', 'string')},
  ${utils.getEventParam('ga_session_id', 'int')}
from ${ctx.ref('events_*')} pv
where event_name = '${ event_type }'
  and pv.event_date >= '${ constants.analysis_start_date }'
`
);

Now, we can see that the query argument is just a big multi-line template literal enclosed in backticks. On top of that, all the variables and function calls in the embedded expressions look and behave just as they should in Javascript.

Note that publishing definitions isn’t the only thing that you can do in Javascript. You can also run operations, perform tests with assertions, and, of course, declare tables.

Referencing Javascript files in the /includes directory

Now that the relationship between SQLX and Javascript is (hopefully) clear, two last elements in the example above need further explanation. Those are the utils.getEventParam() function and the constants.analysis_start_date variable.

These two variables share something that makes them different than the event_type variable: the two variables are properties of objects which act as namespaces for the variables. The reason for this is that they are imported into the pageviews.sqlx Javascript scope from other files in the includes directory. To illustrate that, here’s the directory structure of this example project.

|—definitions/
    |—pageviews.sqlx
    |—pageviews.js
|—includes/
    |—constants.js
    |—utils.js

Dataform, by default, imports all the modules in the /includes directory. If it didn’t, in order to import those variables in, I’d have to require them at the top of the Javascript file (or in the js block) using syntax like this: const utils = require('../includes/utils.js'); Instead, Dataform provides this guidance and makes importing a little bit easier because of it. Similarly, Dataform brings in functions like ref(), assert(), and declare(). This is a framework called common.js. It might be worth a read if this is all new to you.

Now, let’s step back and take a look at those two files that are imported in.

This is the constants.js file that is typically a place to manage variables that would be shared across multiple files. You’ll notice that this file only exports one variable, but a more extensive project would probably have several.

-- File: includes/constants.sqlx

const analysis_start_date = '20231001'

module.exports = { analysis_start_date };

This is the utils.js file that stores Javascript functions that will be used several times in a file (like the example) or shared across multiple files. It also only exports one function. Credit to Artem Korneev for this really handy function.


const getEventParam = ( eventParamName, eventParamType = "string", columnName = false ) => {
  let eventParamTypeName = "";
  switch (eventParamType) {
    case "string":
      eventParamTypeName = "string_value";
      break;
    case "int":
      eventParamTypeName = "int_value";
      break;
    case "double":
      eventParamTypeName = "double_value";
      break;
    case "float":
      eventParamTypeName = "float_value";
      break;
    default:
      throw "eventType is not valid";
  }
  return `(
      SELECT ep.value.${eventParamTypeName} AS ${eventParamName} 
      FROM UNNEST(event_params) ep 
      WHERE ep.key = '${eventParamName}'
  ) AS ${ columnName ? columnName : eventParamName}`;
};

module.exports = { getEventParam };

Note that these files (modules) could export more than one variable if they were listed, comma-separated, in the module.exports{...} line at the end.

A few more notes on Javascript and SQLX

One thing I also wanted to bring up but didn’t find the place above is that since there is so much use of template strings, you might find yourself needing to use a backtick (`) to enclose a BigQuery table name (though you should probably use ref() instead). In that case, you’ll have to escape the backtick like this: const tableName = \`my-db.my-dataset.my-table\`.

Other than that, I think I covered everything I wanted to. Now, the only thing left to do is make a dbt-esque meme that says something about Dataform just being Javascript template literals and SQL. (If you know, you know.)

Have questions about Dataform functionality? Let me know in the comments, or send me a connection on Linked.

Leave a Comment

Your email address will not be published. Required fields are marked *