The Architecture of Instances

A technical deep dive into how we rebuilt instances.vantage.sh with Go and Next.js.

The Architecture of Instances
Author:Astrid Gealer
Astrid Gealer

In July last year, we shipped the rewrite of instances.vantage.sh (known as EC2Instances.info to many). It has been a huge success. The old architecture of the website served the site well for over a decade, but it was really showing its age by the time we tore it down. The HTML files that included all the content were exponentially ballooning in size to a point where it was hard to cache and slow to initially load, and updating the site required a lot of reasoning around legacy code, which made adding new features more painful than rewriting the entire codebase.

With that said, the codebase has gone through massive changes even post-rewrite, with changes such as GCP support, currency conversion, a KV solution to store configurations in, a brand new high-performance data collector to allow us to ship faster, and much more. In this blog post, we want to touch on all of the moving parts in its current state and why we made the architectural decisions we did.

Collecting the Data

To get the data to show you on the site, we need to first download significant portions of data from the cloud providers! Pre and even post-rewrite for a while, we used a data collector written in Python. This code was ancient (for context, it was mostly pre the bulk pricing API and had artifacts of the Python 2 to 3 rewrite). This data collector was extremely slow, making builds take generally upwards of half an hour. This was for a number of reasons:

  • There was no threading on the old data collector, and this was written pre Python, even having a concept of async I/O in the standard library: The old codebase was written in a different time when there were far fewer instance types. Now, there are over 1000 instance types, which means that we need a solution that can churn through multiple regions at the same time.
  • It used AWS iterators for all pricing data: We do still do this for spotting, due to it being needed. However, for all pricing across all regions, the old data collector would try to run an iterator in the specific region. This was very fragile as is; running a staging and production build at the same time had a strong likelihood of causing one of the builds to fail. This made iteration a much more tedious process, and it was obvious that something needed to be done here. Additionally, we tried threading in the past, and this was much more complex when we had to factor in iterator rate limits and would randomly fail.
  • The output was not simple to determine: The output of different types of data collection was not going through the same pipeline. In something like Go, the output is far clearer. In this, some outputs were objects, and some outputs were generated from classes with complex data structures within them.
  • The codebase was really showing its age: The codebase was hacked upon with small patches for a decade, so the amount of technical debt that built up was significant. This meant that adding features was not very simple. Adding GCP in this codebase would slow things down to a level where GitHub Actions would likely time us out.

It was obvious we needed to change something here, and after some discussions and thought, we decided on Go. Whilst we love TypeScript, and we used it in the rest of this, the ability to build certain data types in Go and use them as wrappers to modify the JSON output, the powerful ability to goroutines to parallelize data fetching, and the memory saving advantages of Go (we fit a nice middle ground here, we don't need all the savings of a language like Rust, but we do need good parallelization primitives without loads of complexity) made the language the perfect choice for this.

The Big Rewrite

We started rewriting the data collector with Go, not too long after the rewrite shipped. We didn't do it at the same time as we did the UI rewrite because we felt the most important priority was fixing the long-standing bugs with the old codebase that we couldn't solve with the technical debt before, but after that came out, and a few early bugs were squashed, we had some time to do this.

The data collector is a single Go binary that is run in the root of the repository and invokes each cloud provider module. Each cloud provider has a unique module that runs in parallel with each other. For both GCP and Azure, we use their API's. Azure has a technically paginated API that seems to just return a huge JSON array of everything, which we consume (and download some legacy files that our users expect), and GCP also has its own API's for pricing across regions.

When a JSON file is created by a module in this tool, we go ahead and compress it with .json.br and .json.gz for legacy support reasons. In the previous tool, this would be a process that would take several minutes at the end, but now, leveraging goroutines and the great Go implementations of both compression standards, the impact on the build time is negligible.

Getting AWS Working Fast

AWS is by far the most complex cloud provider that we handle within this tool. For all our needs, we churn through about 20GB of JSON to get all the information! Additionally, we need to pull our pricing information from 2 sources. AWS China, whilst offering a very similar product range and set of API's, is a distinct service with the pricing in CNY rather than USD (meaning it needs its own pathway). Luckily, AWS now offers the bulk pricing API, which allows us to pull a bunch of static files rather than use an iterator. This is better for all parties, and means we do not need to handle ratelimiting as much.

Interestingly, AWS was complex from a "this is too fast" standpoint initially. Go was downloading so much data that it was OOM'ing my system just trying to run the application. Because a lot of the data is so functionally similar, we have a pipeline that all the services go through. This consists of the following:

  • Each consumer of the data (a service within AWS) has a channel created by something for it. They all have a consistent data type with the region slug, raw AWS data, and a Savings Plan function (see next point). At this point, you would make a China and non-China channel (the main difference being currency and save path).
  • The consumer keeps receiving on the channel. It checks if the Savings Plan function is nil (not the result, just the presence of the function). If it is, we break the receiving loop here and store the Savings Plan function for the next step, and kills the channel. If not, we process the data, going carefully to store the instance mapping to the SKU so that we can use the SKU later.

After we have broken the channel loop, we can call the Savings Plan function in a for loop. This function is created when we go through the indexes obtained via the bulk pricing API and will contain either a function that quickly returns a nil map (this means "we have iterated all regions, but there's no pricing plan"), or it will be what we call a "blocking function" (basically a helper handler where when it is created it will quickly return a function that will wait for the goroutine spawned by the helper to be done before it gives the result, this is spawned all the way at the start, so it is likely done by now) that returns a map with Savings Plans. Since we stored the SKUs' mapping to instances earlier, we can use this to apply the Savings Plans to the individual instances.

We additionally use this blocking function pattern for ElastiCache parameters. This is because they are made fairly quickly, used after most of the iterations, not in the bulk pricing data, and used by both global and China. For EC2 instances, we have to grab some further descriptions from the API. For this, we use function groups to do this across multiple goroutines for different regions and then use a "slow building map". This works in the following way:

  • The slow-building map builder takes in a method with a pushChunk parameter. A goroutine is spawned to handle this; done is defaulted to false.
  • When Get is called on this structure, it will try to read the lock and get from the map; if it has been set, the result is returned immediately. If not, it will read unlock and sleep for 10 milliseconds in a loop.
  • When this method is done, the map is considered done. Anything that is called Get and was unable to find a key after this is done (or is still looping waiting) gets an ok of false.

This works brilliantly for EC2 since many regions only have a subset of instances, so this allows the obvious instance types to be processed rapidly.

This channel pattern allowed us to centralize the changes to the pipeline that allowed for a smaller number of items to be processed at once. Now, we process everything in batches of 5 regions for each service and AWS global/China in parallel. This works as a limiter to ensure that we aren't overloading our system with too much data at once!

The End Result

On average, with the new data collector, the time to collect all the data went from over 30 minutes to 3-4 minutes (and this is with AWS China, Savings Plans, and GCP!). We also do not get ratelimited by AWS with this, so we can happily ship to both staging and production. We also offer a mode for local development if you aren't touching the data collector, where you can download an archive of the last data collection by staging, but for our production needs, this has significantly changed how quickly we can ship new fixes to our users.

Building the UI

The original UI was powered by injecting a ton of HTML elements into the content with the Python data collection script and then using jQuery with the data tables plugin to manage the webpage. For a while, this did work fine. However, as AWS grew into hundreds of instances, it had a number of major problems:

  • No modern frameworks doesn't mean no state, it means your state gets scattered in many different places: There is a concept that not having to manage React hooks is a good reason to use pure JS without modern frameworks, but the reality in this codebase was that not using modern web frameworks meant that our state was scattered in many different places with hooks everywhere. There wasn't one component for something, there were fragments of what may be considered components everywhere.
  • The libraries we were using no longer had support: DataTables and the other jQuery plugins are essentially finished software. They did what they intended to do, and now they are no longer updated. However, we have further needs than these libraries were designed for. What about loading only a chunk of rows to optimize performance? What about Node support for things like newsletters? These were things we just couldn't explore with the libraries we were using before. For example, filters in Firefox hung the whole browser for several seconds.
  • We couldn't optimize the old websites data loading pattern: As we'll get onto later, the new site has a complex loading pattern to load data in fast and without requiring too much bandwidth. This is not something we could do easily with the old site because of the technical debt built up over the years and the lack of compilers which were used. Importing WebAssembly to help with this would have been a huge technical burden!
  • The DX of the old site was extremely poor: The website generation was tied to the data collection Python script, and required the 30+ minute process to build the data, and then presuming your Python version lined up properly, you could then build the website. Additionally you needed the correct web server for the content, and building the SCSS was separate. There was many attempts to fix this over the years, but this was clearly far worse than the instant and clear development servers that engineers expect.

Due to all of this, we decided to rewrite the website. We did want to keep the static nature of everything because that made it far easier to reason with from a infrastructure perspective, but we wanted to move to something far newer. There are a lot of options when it comes to JavaScript frameworks, but we went for Next.js with Tanstack libraries to manage many complex things we need (like the table structure and virtually loading rows so we do not overwhelm the DOM). We did this for a number of reasons:

  • Next is very well supported: Next.js is used by most major websites, and App Router is the default way to use it now. We would like a framework that can last us as long as possible, and Next.js served this. The team supporting Next.js have also been fairly fast at fixing a bug we found in the framework, leading to a overall safer LRU system within the framework.
  • RSC's fit our needs very well even in a static context: The RSC format fits most of our needs really well and even in its static form allows us to split out the layout from the content, leading to bandwidth savings.
  • We can use React and most modern TS tooling that comes with that: We did want a framework to manage our table still since we had a fairly predictable but complex set of needs for this website, Luckily Tanstack Table and Tanstack Virtual fit this bill really well.

For the OpenGraph images, we manage them outside of Next because we want a static entry point for each thing that outputs an image file. For this, we use the fantastic sharp library with a lot of logic based on the instances data to generate many images at once. The joy of modern TypeScript is that we can reuse many parts.

This is also amazing for our llms.txt generation. Since launch, instances has maintained a llms.txt file which is generated from tables and datasets made for the main website. We also offer indexes that can be used for agents to rapidly find the data they need. The content we build here is what powers our Instances MCP Server. One limitation of Next is that you can't generate [slug].txt files statically like you can in Astro, so we do have to pollute public with this, but when everything is built, this doesn't make too much of a difference.

Porting the Legacy Code to React

Porting the legacy code was a complex task. We wanted to add some new functionality when it came to table resizing and cleaning up the way the table is structured. To do this, we moved over to Tanstack Table. This allowed us to have a clean way to structure the table schema, which we can then put our renderer into. This means that we can simply pass Tanstack Table all of our content, and it will render it with the custom renderers we created. This is much cleaner than the hacks that previously had to happen to handle the HTML that was inserted.

To generate most of the columns, we used Cursor with access to the new and old codebases. This did not generate a perfect result by any means (this was a while ago; newer Claude models may well do this better now), but it did give us a good starting point, which we could add upon.

Additionally, because we are using React, we can easily pass in a state that can change. This means that when the page loads, we can load some of the content and then pull in the rest using our data loader.

We did, however, notice that this did not solve all our problems. When you opened the page, it would still jump a bit while it loaded all the content, and filters were still slow in Firefox. This is because by default, Tanstack Table pulls every record into the DOM. This was unacceptably slow for us as the number of instances grow at a large amount. Therefore, we went ahead and used Tanstack Virtual to handle this. Tanstack Virtual is a powerful library that allows us to have the best of all worlds. We can sort by all records, but they do not all need to be in the DOM. Additionally, because we specify the total size of the table, we can create a mostly accurate scroll bar while the rest of the content loads.

For the state, we went ahead and migrated away from DataTables. This migration was a bit complicated and required reverse engineering the state that is generated by the library, but once that was done, we had a migration path into the new format. With this format, we store the key of the column when it comes to visibility. This means we can very cleanly have a good representation of how the data looks within the framework.

We used static mode in Next to handle building the content. This works well for us because we grab all the content at once (so dynamic caching isn't super useful for us), and we can handle building Next in one step, which we can then deploy anywhere (meaning we do not need something like OpenNext in the project). This has worked great for us.

Overall, porting to React for this website was the correct call. This allowed us to take all our business logic and put it into encapsulated components, which can be tested and updated separately. React allows us to have a strong record when new issues crop up, solving many in under an hour. The whole rewrite took a couple of months (including a lot of testing to ensure that we had most legacy expectations covered), and created a good future for the website.

The Data Loading Process

In the previous website, we loaded all the content into HTML. For a while, this worked fine; however, this was really showing its age, and we had many HTML files that were well over 10MB. The TTFB of this was extremely poor, which isn't a good experience for our users.

Data loading was an interesting problem. We had way too much data to just put in the RSC, and serving it uncompressed was very wasteful. However, we did not want to serve it as one chunk since we wanted bite-sized files that could easily fit in any cache. Additionally, if the user was using a reverse proxy that waited for all the data to load first, we didn't want the site to be slow while waiting for the content. However, we also did not want it to have a SPA-like feel when loading the site. Therefore, for our largest tables, we do the following:

  • We split the data into 11 chunks and compressed the pricing data. The first chunk is a smaller number of instances that are included in the data so that it does not feel slow; the other 10 chunks are a far larger number of instances that all need to be compressed. We compress the pricing data with a domain-specific compression algorithm that works well for pricing here.
  • For the other 10 chunks, we go ahead and encode them with msgpack (chosen for its small size and ability to stream decode) and then compress that encode with libxz. This gets us 10 .msgpack.xz files.
  • On the client, we use useSyncExternalStore to manage the data loading. We have a React hook to manage caching the data loading and to spawn the workers for each chunk. Decompression is formed of 2 workers: a deserialization worker and a decompression worker. The decompression worker sends down the data (using the xz-decompress, which turns the libxz decompression code into a simple WASM binary that can be run by the worker), and this is then used as a reader in the deserialization worker. The reason for this is that both sides block their respective thread for a while, so doing it in one worker meant that each side slowed down the other.

For the smaller tables, we generally just push the data up as-is since this would be more of a latency hit than it’s worth. We may sometimes compress the pricing data. This all results in blazing-fast load times for all data. Our 20GB of AWS data turns into at most 10MB across 10 chunks.

The End Result

In the end, this rewrite allowed us to massively slash our time to first byte on all content on the site, vastly improving our Lighthouse score to between 90 and 100 for most attempts. We no longer had to worry about the HTML going over the cache limits, and it laid the groundwork for new ideas to be implemented (such as GCP and currency support). Additionally, it makes updating the codebase with LLMs far easier. This is because everything within the codebase is more type safe, so LLMs get instant results when it comes to knowing if they made mistakes. This means that we can delegate bug fixes to LLMs, review them when they are done, and work on other products and features that are useful to our customers.

Storing User Configurations

For a long time, your instances configuration would be stored in the URL parameters. This only went so far, though. The URLs got unwieldly and they could only store a limited amount of information. Part of your state was within local storage, and part of it was within the URL parameters. This made sharing the exact state of your configuration impossible in a lot of situations. As we added more functionality like multi-currency support, it became very clear quickly that we could not continue down this path. The URL length was getting unwieldy.

Migrating away from this was a complex task, but by doing the rewrite, we had a lot of the groundwork laid out. As mentioned above, we migrated away from the buggy DataTables local storage entries previously used. We did need to do another migration from that to the new way we are storing state within local storage. With this new way, whenever you make a change, a timer is started. When that timer runs out, a POST request to instanceskv is made. This is a small Cloudflare Worker that does some magic with Cloudflare's cache and KV API's, as well as a valibot schema to take the input and turn it into a hash which represents the configuration state. This is then stored in the URL as ?id=<id>. This state fully represents everything you see on the page, and doesn't contain the weird legacy quirks of the old data storage structure (comparisons did weird things with the global search in the old structure). This change makes new changes like currency trivial to add to the state management.

Handling Tooling That Headlessly Use Instances

After we shipped the instances rewrite, we also shipped newsletters.vantage.sh. This tool uses a virtual copy of instances with a configuration based on the KV hash specified (if you click a preset, it uses one we made, if you specify a URL, it gets that from the KV). After it has the configuration, it goes ahead and tries to check for differences every hour using it. If it notices them, it mails out. React is fantastic for this too since it runs well on the server!

To get this working, we effectively have a library that includes the ec2instances.info repository within it. When we compile the virtual instances library, it builds out a entrypoint which uses Tanstack Table outside of React to manage the table contents. Since Tanstack Table does not require a DOM, this works brilliantly. When we want to check a React element, we just render it and diff the HTML. To do this, we have a server running this within Bun, which handles everything required. For some of the components, we also bundle in a polyfill for Next.js behaviour, which is missing within the Bun context. This allows us to handle many users’ configurations in a rapid way.

Benchmarking Instances

We recently also added the ability to view and sort by instance benchmark results. This was a large undertaking, and due to the complex nature of configuring it, not something we intend to support DX for in a wider group than within Vantage and therefore in another repository. The way we run benchmarks is the following:

  • We have a Bash script that defines how the instance image for both arm64/amd64 should be configured, ready to be benchmarked. It also copies to a Go server that sits on an HTTP port waiting for a request, so that the benchmark runs, and it can run its command set. When a benchmark is run, this server does the following (this means we do not need the overhead of an SSH connection):
    • It runs the command cd /home/ec2-user/coremark && make XCFLAGS="-O2 -DMULTITHREAD=$(nproc) -DUSE_FORK" compile. This is because, to solve some previous issues, CoreMark needs to be compiled on the specific instance so that it has all of the core information it needs.
    • We run nvidia-smi --query. If this errors, we presume there are no NVIDIA GPUs. If this succeeds, we store the result for data processing later. This gives us in-depth information on the NVIDIA GPUs in the system, if there are any.
    • We run coremark in a loop until the string does not contain ERROR!. This is to prevent a situation that sometimes happens where it exit code 0's but does not actually succeed properly.
    • We call the brilliant ghw and sysinfo libraries to get in-depth information about the system information of the instance this is running on. There are many times when this contains interesting information that is not exposed by the AWS API.
    • We then run the ffmpeg benchmark. To do this, we encode the first minute of Big Buck Bunny in 1080p from the full 4K download (this is downloaded by the benchmark script) with the following flags:
      • If we detected a NVIDIA GPU: ffmpeg -hwaccel cuda -hwaccel_output_format cuda -i /home/ec2-user/input.mp4 -ss 00:00:00 -t 00:01:00 -c:v h264_cuvid -vf scale_cuda=1920:1080 -r 30 -c:v h264_nvenc -preset p5 -cq 23 -c:a aac -b:a 192k -y /home/ec2-user/output.mp4 (this encodes the first minute, forcing ffmpeg to use CUDA acceleration)
      • If we have over 4GB RAM: ffmpeg -i /home/ec2-user/input.mp4 -ss 00:00:00 -t 00:01:00 -vf scale=1920:1080 -r 30 -c:v libx264 -preset fast -crf 23 -c:a aac -b:a 192k -y /home/ec2-user/output.mp4
      • If neither applies: This benchmark is not run because it would take so long to run that this instance is not suited for video encoding.
    • We then serialize all this data and send it back to the client.
  • We then have a Bash script to send the provisioning script up to an instance with the server, sets up systemd, and then gets AWS to generate an AMI from this image for both arm64/amd64. This is a fairly long process, but it only needs to be done when it is changed.
  • From here, we use a script that provisions instances with the AMI (either the amd64 or arm64 AMI, whichever is applicable). The script desperately to spot instances and will find the cheapest region (all of this is very important when you benchmark ~1k instances), but will fall back to on demand if this is the only strategy. For each instance, it waits for it to boot, waits for the HTTP server to be alive and sending a 404, then calls the benchmarking endpoint. The script then outputs the result to a json file and kills the instance as fast as possible.

We then have tooling to take the outputted data and output it into something that instances can use. After this, we commit it to the large JSON file of manually fetched data and then this adds to the dataset of unique data that is only available on instances.

How we Serve the Website

We previously served the website using EC2 instances with nginx. This served the site well for a while, but we could certainly do better. The website was made before the days of serverless, and these days, hosting a static site on physical servers means we have to patch them, handle scaling of them, and have a single point of failure that is far more likely than popular CDNs. We picked Cloudflare to host our content on because they have a fantastic CDN that reaches out to all parts of the world and good mechanisms to deploy our content worldwide with good caching. This means that no matter where you are, you should get a fast response from instances.

Workers Static Assets is brilliant, and we use it extensively for other things, but it wasn't going to cut it for this. Due to how many files we generate (between HTML, JSON, LLMs data, and images), we go over the 100,000 file limit, and we also have several files that go over their maximum size. Due to this, we have a custom script that stores the state of everything and tries to put the smaller files onto Workers KV for better replication, whilst also ensuring the larger files go to R2. If files are deleted outside of _next, this also purges them. The CI will also purge the Cloudflare cache when done. We then have a Cloudflare Worker that handles routing between the different data storage solutions and all Next-specific routing that we need to do. Whilst it would be nice to be able to use Static Assets here, this solution does work pretty great, and getting a high cache rate on Cloudflare means that our workers barely get invoked!

We care deeply about the community and are immensely thankful for the trust you put in our product! We care deeply about supporting this tool, and have even more exciting sister websites coming up! Stay tuned!

Sign up for a free trial.

Get started with tracking your cloud costs.

Sign up

TakeCtrlof YourCloud Costs

You've probably burned $0 just thinking about it. Time to act.