rngo

Tutorial

Realistic Data

The data generated for a stream is guaranteed to match its JSON Schema. So we can improve the realism of the simulated data by specifying standard validation keywords:

streams:
  users:
    schema:
      type: object
      properties:
        id:
          type: integer
          minimum: 1
        full_name:
          type: string
        created_at:
          type: string
          format: date-time
      required:
        - id
        - full_name
        - created_at

Before re-running the simulation, move config.yml under the .rngo directory - rngo CLI will look for the config there by default.

Now run rngo sim. The new data will be under a new simulation directory, and the lines in data.json will look something like this:

{"id":69266575659,"full_name":"CY$,tDUXHd[qc/!gbg4/tS6AEZ{)(5","created_at":"6469-04-24T03:24:12.914Z"}
{"id":167604,"full_name":"mZf?6Q@mCC*y4cKxv-SX:}eW*Pp_ur","created_at":"651-09-18T11:12:13.159Z"}

It's a modest improvement: the IDs are not negative, and timestamps are ISO-8601. But the IDs are non-sequential, the names are still random and one of the created_at values predates the invention of the computer by well over a millenium.

To make the data actually realistic, update the config to use the rngo.value keyword defined by the rngo custom JSON Schema vocabulary:

streams:
  users:
    schema:
      type: object
      properties:
        id:
          type: integer
          minimum: 1
          rngo:
            value: (streams.users.last.id ?? 0) + 1
        full_name:
          type: string
          rngo:
            value: enums.fullNames
        created_at:
          type: string
          format: date-time
          rngo:
            value: sim.now
      required:
        - id
        - full_name
        - created_at

Run the simulation again. The first two lines will be semantically equivalent to this:

{"id":1,"full_name":"Sophia Rodriguez","created_at":"2023-07-08T04:17:01.331Z"}
{"id":2,"full_name":"Olivia Thompson","created_at":"2023-07-08T04:17:16.504Z"}

Which is pretty realistic! So what exactly did the rngo.value keyword do?

Similar to the standard enum and const keywords, rngo.value restricts the generated result to a desired set or value. But it does so by defining an dynamic expression which gets evaluated prior to generating each event.

So, prior to generating the first event, you can think of the schema as:

streams:
  users:
    schema:
      type: object
      properties:
        id:
          type: integer
          minimum: 1
          const: 1
        full_name:
          type: string
          enum:
            - Isabella Mitchell
            - Olivia Thompson
            - Ethan Taylor
            - Benjamin Davis
            # many, many other names
        created_at:
          type: string
          format: date-time
          const: 2023-07-08T04:17:01.331Z

Then, prior to the second event, it is effectively:

streams:
  users:
    schema:
      type: object
      properties:
        id:
          type: integer
          minimum: 1
          const: 2
        full_name:
          type: string
          enum:
            - Isabella Mitchell
            # etc
        created_at:
          type: string
          format: date-time
          const: 2023-07-08T04:17:16.504Z

The rngo.value keyword and schema expressions are fundamental to the realism of rngo's simulations. To learn more, see Expressions.

Previous
First Steps