Tutorial
Realistic Data
The data generated for a stream is guaranteed to match its JSON Schema. So we can improve the realism of the simulated data by specifying standard validation keywords:
streams:
users:
schema:
type: object
properties:
id:
type: integer
minimum: 1
full_name:
type: string
created_at:
type: string
format: date-time
required:
- id
- full_name
- created_at
Before re-running the simulation, move config.yml
under the .rngo
directory - rngo CLI will look for the config there by default.
Now run rngo sim
. The new data will be under a new simulation directory, and the lines in data.json
will look something like this:
{"id":69266575659,"full_name":"CY$,tDUXHd[qc/!gbg4/tS6AEZ{)(5","created_at":"6469-04-24T03:24:12.914Z"}
{"id":167604,"full_name":"mZf?6Q@mCC*y4cKxv-SX:}eW*Pp_ur","created_at":"651-09-18T11:12:13.159Z"}
It's a modest improvement: the IDs are not negative, and timestamps are ISO-8601. But the IDs are non-sequential, the names are still random and one of the created_at
values predates the invention of the computer by well over a millenium.
To make the data actually realistic, update the config to use the rngo.value
keyword defined by the rngo custom JSON Schema vocabulary:
streams:
users:
schema:
type: object
properties:
id:
type: integer
minimum: 1
rngo:
value: (streams.users.last.id ?? 0) + 1
full_name:
type: string
rngo:
value: enums.fullNames
created_at:
type: string
format: date-time
rngo:
value: sim.now
required:
- id
- full_name
- created_at
Run the simulation again. The first two lines will be semantically equivalent to this:
{"id":1,"full_name":"Sophia Rodriguez","created_at":"2023-07-08T04:17:01.331Z"}
{"id":2,"full_name":"Olivia Thompson","created_at":"2023-07-08T04:17:16.504Z"}
Which is pretty realistic! So what exactly did the rngo.value
keyword do?
Similar to the standard enum
and const
keywords, rngo.value
restricts the generated result to a desired set or value. But it does so by defining an dynamic expression which gets evaluated prior to generating each event.
So, prior to generating the first event, you can think of the schema as:
streams:
users:
schema:
type: object
properties:
id:
type: integer
minimum: 1
const: 1
full_name:
type: string
enum:
- Isabella Mitchell
- Olivia Thompson
- Ethan Taylor
- Benjamin Davis
# many, many other names
created_at:
type: string
format: date-time
const: 2023-07-08T04:17:01.331Z
Then, prior to the second event, it is effectively:
streams:
users:
schema:
type: object
properties:
id:
type: integer
minimum: 1
const: 2
full_name:
type: string
enum:
- Isabella Mitchell
# etc
created_at:
type: string
format: date-time
const: 2023-07-08T04:17:16.504Z
The rngo.value
keyword and schema expressions are fundamental to the realism of rngo's simulations. To learn more, see Expressions.