Monday, November 28, 2022

Bring Me the Severed Head of my Data

[This article is primarily intended for DataWeave developers. It deals with code and development strategies.] 

A while back, I wrote this article suggesting an approach to generating sample data with a simple DataWeave function. I left the conversation open ended to offer time for readers to suggest their own solution. I provided a sample (the severed head of my data) and a shell that suggested an approach to generation of sample data at any scale.

In this article, I offer one possible solution. If you have not read the OP (as it were) then give it a read and then see if you can hack out a solution that you like. If you've already given it a try, or if you simply want to see a solution dissected, then by all means, turn the page.

The premise is that you sometimes need to synthesize data for a project, and although there are tools that can readily help you do this, sometimes the characteristics of the problem domain call for a customized solution. Here are some reasons you might turn to DataWeave to help you.

Consider the case that your Mule app will ingest a stream of objects that arrive at a variable rate. You might simulate this by feeding objects into a VM queue or a JMS queue using a Scheduler to regulate the rate of object generation.

Or think about how you can deal with the condition that your API will be presented with a collection of objects and must process the collection efficiently at scale. After assembling the processing logic, you might want to spin up a mock data source that can present a scale replica of your expected data.

So in my previous installment of this conversation, I gave you this to begin:

%dw 2.0

output application/json

/*
* Create a function that accepts a parameter N.
* it should create N records with elements chosen
* randomly from the arrays below
*/
var sampleRecord = {
"name": "General Robotics",
"account_id": "1001699305",
"created": "2022-04-10 01:47:53",
"city": "Farmington",
"state": "IL",
"postal": "79068"
}

var companies = ["Giant","Greed","Value","Pros","Family","General","Empire"]
var industry = ["Hardware","Media","Foods","Medical","Automotive","Sports Wear"]
var cities = ["Austin","Boston","Detroit","Chicago",
        "Phoenix","Dublin","Paris","Dimebox"]
var states = ["TX","IL","MI","AZ","TN","WI","CA","RI"]

---
sampleRecord

It actually turns out, when you begin to consider the issue seriously, that you will need several functions. The suggested function will need to synthesize "name," "city" and "state" from the values in the "seed arrays." The "account_id" and "postal" field values can be synthesized arithmetically using the randomInt() function and a little thought.

So, let's begin with a function that will generate the company name. A simple and casual function might look like this:

fun createCompany() =
companies[randomInt(sizeOf(companies))]
++ " "
++ industry[randomInt(sizeOf(industry))]

Okay, I ain't gonna lie. This code makes my teeth hurt!

For one thing, it fails at being a pure function. And secondly, it applies the most primitive of DataWeave operations, string concatenation. All too often, I find bad DataWeave transformations that contain what I call the Endless Graveyard of Concatenations. It's not the end of the world (although, I must confess that I've seldom traipsed to the end of the "Graveyard" so there may be an apocalypse along the way that I missed somewhere), but there is a better way to do this.

A simple fix would be to use string interpolation. But that solves only part of the problem. (and the way this code is written, it would have us create a pretty lengthy line of code for the function.

The larger problem is the pure function thing. Now most of the "awe" that functional programming mavens hold for the concept of pure functions is relevant mostly in the badlands of other development languages and platforms that define mechanisms to govern variable storage and scope.

In DataWeave, there are no "static" or "transient" or "stack perpetually ephemeral" variables. By default, all variables declared in the header of your transformation are global to that transformation, and if you use the do{} enclosure to create an "inner transformation" then you may consider variables created there as global to that enclosure.

Our createCompany() function matures a lot when we use this:

fun createCompany() = do {
var companies = ["Giant","Greed","Value","Pros","Family","General","Empire"]
var industry = ["Hardware","Media","Foods","Medical","Automotive","Sports Wear"]
var cname = companies[randomInt(sizeOf(companies))]
var iname = industry[randomInt(sizeOf(industry)))
---
"$(cname) $(iname)"
}

The body of this function is now easy to read and to maintain. The "seed array" is localized to this function because it is not needed elsewhere, and this is now a pure function because it does not depend upon external references to data. (Although much of the value from being a pure function is not relevant in DataWeave, this consideration is a big deal. The emphasis on generalization of functions, and the reuse of working logic has us always asking ourselves how overspecialized our functions might be.)

Here's how some of the other utility functions will look then:

fun createCity() = do {
var cities = ["Austin","Boston","Detroit","Chicago",
        "Phoenix","Dublin","Paris","Dimebox"]
---
cities[randomInt(sizeOf(cities))]
}
fun createState() = do {
var states = ["TX","IL","MI","AZ","TN","WI","CA","RI"]
---
states[randomInt(sizeOf(states))]
}

I will not do it here, but the similarity in these functions suggests that they could be collapsed into one if we are willing to pass the "seed array" as a parameter. The localization of the seed arrays would not be necessary in such a case.

To get the "account_id" and the "postal" code, we require a pair of functions that simply construct a value suitable for our output record.

fun createPostal() = (randomInt(70000) + 30000) as String {format: "00000"}
fun makeAccountID() = now() as String {format: "ddhhmmss"}
    ++ randomInt(100) as String {format: "00"}

To get a postal code (presuming the simple US pattern), we get a number between 0 and 69,999 and boost it just a little bit with addition so that our final range of values is between 30,000 and 99,999.

(randomInt(70000) + 30000)

Then we convert it to a String with a pattern that requires five significant digits. That's a perfectly apt range for our simulated postal code.

as String {format: "00000"}

For account ID, we work a little harder.. First we take a date stamp from the time of execution. We arrange that as a string that contains "day," "hour," "minute," and "second." 

now() as String {format: "ddhhmmss"}

We then enhance it (using concatenation; yes, I'm aware of the harsh things I've said in the past) using a two digit salt.

randomInt(100) as String {format: "00"}

So now we are ready to assemble the function suggested by the problem description.

/*
* Create a function that accepts a parameter N.
* it should create N records with elements chosen
* randomly from the arrays below
*/

Are you ready? Let's try this:

fun createAccounts(c:Number) =
    (1 to c) as Array map (d,i) ->
        {
            name: createCompany(),
            account_id: makeAccountID(),
            created: now() as String {format: "yyyy-MM-dd hh:mm:ss"},
            city: createCity(),
            state: createState(),
            postal: createPostal()
        }

Our function accepts a number C and uses it as a repetition count. It will produce that many records. We call each of our functions to create the elements of the record. We probably should have written a function to set the "created" timestamp. It's simple enough to express inline, but ideally, we would isolate it for ease of maintenance and update.

When seen in the DataWeave Playground, the outcome looks like this:


Give this a try yourself.

You can download the jar file from this project and try it out yourself in Anypoint Studio.


To learn more about DataWeave, check out the DataWeave Tutorial on the DataWeave Playground (Find the button at the upper right-hand side of the screen). The MuleSoft Blog also provides a number of HowTo articles that may be helpful to you. The best way of course, is to visit the MuleSoft Training website to discover all your options.

Vincent Lowe is a Senior Technical Instructor for Salesforce Trailhead Academy. He has trained developers in C, Java, Perl. Python, Javascript, and DataWeave. The views expressed here are his, and not necessarily those of MuleSoft or Salesforce.

No comments:

Post a Comment

Reduce to Dashboard

When developers use DataWeave, they often come to rely on the reduce() function to fill in any gaps left by the standard Core library. Altho...