Jump to content

Nullable's Dev Blog #2 - Creating a Custom Digital Store


Nullable

Recommended Posts

It's been a few months since the release of the custom Tribot store and I've had several people interested in how we built it and what technologies we used and why. These topics range from intermediate to pretty complex in terms of development knowledge and I probably am not going to explain them well enough. If you have any questions on any component, please ask in the comments and I'll respond!

To give some context, we wanted to essentially rewrite the entire tribot backend. The script uploading, instance management, store, payment system, order system, etc... everything. We estimated that not only could be improve many systems with a rewrite but also reduce costs by a staggering amount.

We wanted to rewrite everything and release it with zero downtime.

The system we had at the beginning consisted of:

  • Auth0 for account management (keep)
  • Cloudflare for DNS, analytics, and security toggles (keep)
  • Shopify for the entire store/payment system (remove)
  • 3 AWS-managed MySQL databases for the forums and old repo (remove)
  • A PHP webserver for the old repo backend and frontend (remove)
  • A java webserver for issuing session tokens in exchange for Auth0 JWTs (remove)

So our task was to migrate 2 mysql databases and all the shopify data while also fundamentally changing some of how our systems worked. For example, we wanted to make the tribot client use stateless auth with the Auth0 JWT rather than simply use it to exchange for a session.

Tribot's Tech Stack

Infrastructure

We chose to move everything to DigitalOcean. They have a fantastic value-to-cost ratio along with being reliable even for critical systems.

Database

We chose to consolidate all of Tribot's main system data into a single DigitalOcean-managed Postgresql database. I work with distributed data a lot and knew that 1. it's really, really complicated and time consuming to get right and 2. Very few businesses need distributed data storage and Tribot will not be an exception.

While I debated back and forth a lot between a nosql database like MongoDB and SQL, I ended up choosing Postgresql due to its performance and flexibility. SQL can be cumbersome to do right, but I decided it was worth it for the benefits.

To handle migrations, we use Flyway. Database migrations are a must-have in my opinion. They make it easy to set up test/local databases and provide a standard, safe way to modify the tables.

Docker

We technically don't have Docker anywhere in the production architecture, but we use dockerfiles to create containers, and docker for local dev. All of our server code runs in containers. Containers are nice because we can control the whole webserver environment from source code. If you're unfamiliar with containers, I highly recommend looking into them if you're interested in web development or backend development.

Kubernetes

Yes, we run Tribot on a DigitalOcean managed kubernetes cluster. Why? After all it's not like we're doing microservices (that would be extremely unnecessary). But the thing is, Kubernetes is extremely useful even without microservices. I like to consider the architecture of Tribot now to be a "Distributed Monolith", which means our server code is written such that it can run in parallel with itself. So we can run any number of copies of our webserver, which helps with availability.

Take this k8s config for example:

spec:
  replicas: 3
  selector:
    matchLabels:
      app: rune
  minReadySeconds: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1

This very small piece of config basically tells my Kubernetes cluster that the "Rune" service (which is what we call our main application) should run 3 processes of itself in parallel. "maxUnavailable: 1" means that when we update, the minimum number of running servers is 2. When we deploy a new version, it will spin up the new processes, wait until they report "ready" for 10 seconds, then kill the old ones, which means 0 downtime.

And k8s automatically handles traffic that comes to the Rune service with a load balancer it manages on DigitalOcean.

All of that for only 11 lines of YAML config? Yup, Kubernetes is pretty great. If I wanted to run another service or tweak the scaling on anything, it's a simple config change.

File Storage

Since all of our code runs in containers and has to be stateless in order to run in parallel, it means we can't really use the local file system much. There's always block storage or remote file systems, but for our needs, the easiest thing is using "Object Storage" (like Amazon S3). We chose DigitalOcean Spaces for this.

Components

Frontend

We chose Next.js for the frontend and have it as its own dedicated server. While this could be a full stack service, we wanted most of the logic to be in a different technology and not so coupled to the frontend. We tend to only run 1 Next.js service at a time. Deployments will briefly have 2 running so that there is zero downtime.

We wanted to write the UI using React and Material UI. Next.js seemed like a very attractive option as a framework with those technologies. We really liked its ability to combine different forms of html generation, like clientside, serverside, and especially Incremental Static Generation. We wanted something we could ensure has high Search Engine Optimization while still benefiting from the productivity of a javascript framework.

Backend ("Rune" Service)

This is our distributed monolith web application. It's coded entirely in Kotlin using Vert.x as the webserver. We chose this combination because of Kotlin's fantastic syntax and Coroutines, and Vert.x because it is the most mature non-blocking JVM webserver that supports kotlin coroutines. This non blocking webserver performs really well with a lot of bursty small requests, which is exactly how Tribot tends to scale up when you guys are botting a lot.

We also added JOOQ as our database access library. It generates a lot of boilerplate and allows us to avoid writing plain SQL strings, which are prone to bugs as the database changes.

Hasura

We host a couple replicated Hasura instances in our k8s cluster. This was added near the end of development. This basically acts as a catchall for interacting with our database. It's extremely productive for advanced queries.

It was getting very time consuming manually implementing everything the frontend needs in the Rune service. Hasura sped up the end of development significantly. Had I found it sooner, I might have used it for many more database interactions.

Cron Trigger

We need to run code periodically for things like cleaning up unpaid orders, setting forum ranks, etc. In a normal basic webserver, you might just create a cron job. Well, in a distributed system, that's actually not an easy task. There is no central "server" and the web services can arbitrarily get killed and started. And at any given time there can be running many instances of each service, so we don't want the job running on all of them, but rather, just one of them.

I developed a tiny separate app using the Rust language and have it scheduled to execute with Kubernetes built in scheduler. All it does is take in a string argument representing an HTTP endpoint and then it calls it (with some retry logic).

So now we can just run "cron jobs" that call our Rune service for functionality using a simple config like:

spec:
  schedule: "*/5 * * * *" # every 5 minutes
  successfulJobsHistoryLimit: 1
  failedJobsHistoryLimit: 2
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: cron-trigger
              image: ...
              imagePullPolicy: IfNotPresent
              args: ["http://[rune-service]/endpoint-to-call"]
          restartPolicy: OnFailure

And the logic of the cron can live in "Rune".

Traefik (Reverse Proxy / Load Balancer)

Whenever the tribot client or website frontend code calls tribot servers, it actually hits this component first. Traefik is a nice little webserver that lets us describe which service a request should go to. So if you're calling a specific URL, we can fulfill the request with any service we want. It also lets us run many replications of the same service because it will distribute the requests evenly to each of them.

A reverse proxy / load balancer is a core component of any distributed system.

Repo Compiler

When scripters upload scripts, they upload source code. We like to compile scripts on our side to ensure consistency, compatibility, and safety. This application is pretty simple. It calls Rune to ask for a submitted upload to compile. When Rune responds with the download, it also marks the job as "pending".

The compiler will spin up a Gradle process to compile the script using its own Gradle template. Once the compiled artifact is done, it will upload it back to Rune, where it gets stored in block storage. Rune, upon that upload, marks the job as complete. It will also tell Rune if the compile failed and why, which makes it mark the job as failed and lets the scripter know.

Rune handles silent failures by treating any jobs pending over 15 minutes as available to recompile.

We actually don't host this in our Kubernetes cluster. Script compilation is a resource hog. Dentist's AIO Account Builder is massive, for example. I actually self-host this component at home, though if I need, I can easily throw it onto any decently-specced VPS without issue.

Putting it all together

image.thumb.png.d88951e6036c2354ea844b94ba24842e.png

This architecture gives us a very highly reliable system, with self-healing, high availability, zero downtime deployments, all while being so efficient that it doesn't cost much at all. And while we don't really need horizontal scalability, this system allows us to do so with a simple config change.

This definitely isn't everything there is to Tribot's backend. We have metric servers, Cloudflare configurations, Auth0 rules/hooks, and there's also the forums which run on an entirely different system with integrations to this one.

If anyone is interested in this kind of thing, let me know in the comments any feedback or questions or requests for more topics on our work in this area, such as how we actually managed to migrate all of our data with 0 downtime or how we coded any particular system.

Link to comment
Share on other sites

Really interesting read. Loving the blog posts!

Want to view my scripts? Search "Jamie" in the repository!

If you need support, you can join my support discord by clicking the link below or sending me a PM. :)

vDOP8sO.png

Link to comment
Share on other sites

  • 3 weeks later...
  • 3 months later...
  • 3 months later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...