ORCID at Scale: Improving our own Infrastructure

Will Simpson's picture

Are you interested in learning about how we host the ORCID Registry and APIs? Would like to know how we deal with high availability, scalability, and recovery in the event of a disaster? If so, then this post is for you! 

We handle eight million page views each month on the ORCID Registry, but the bulk of our traffic is on the APIs, which currently receive over 100 million hits per month. One of our core strategies is to invest in developing a robust information infrastructure, so we need to be confident that the technology we use to support this usage is reliable and secure.

The Registry and the rest of the website on orcid.org are routed through a Content Delivery Network (CDN) -- a cloud service provider that has 150+ datacenters around the world. When your browser connects to orcid.org, the static parts of the site are served from a local datacenter near you, to enable faster load times.

The CDN has some other useful features, such as protection against distributed denial of service (DDoS) attacks, and real-time security scanning against hacking threats.

The Registry pages are hosted at our main datacenter, where traffic is load-balanced across a cluster of app servers, while the Registry data are stored in a cluster of three powerful database servers, on encrypted file systems. One is a master database, where updates are made and two are replica servers, which receive a copy of the data in real time. The replica servers are used for most of the “read” operations of the Registry and APIs, but are also hot standby servers meaning they can be promoted to be the master in the event of a failure.

We have an assortment of other servers supporting the production system, which shuffle data around to build search indexes, keep an up to date dump of public data in a different data center, and run scheduled tasks such as email reminders.

We automatically backup the database twice daily, encrypt the dump, and push it to another cloud service provider on a different continent so that, in the event of a disaster at our main datacenter, we can use the database backup to restore the system. We regularly test that this process is working using a temporary offline server.

This is a solid base. However, ORCiD keeps growing and we are increasingly relied upon as part of the research information infrastructure. So we need to do more to ensure the community can continue to depend on us.

What would we like to improve?

We’d like to have app servers and database replicas in multiple locations, so that we don’t have to rely on the somewhat lengthy database restore process, or lose data since the last backup. We’d like to be able to provision new servers in a matter of minutes, rather than hours, in case of sudden increase in demand.

We are considering separating the most critical parts of the system such as registration, sign in, and authorization to an isolated system, and would also like to ensure that public API traffic problems do not impact the Registry and Member APIs.

And we’d like a more flexible architecture using industry standard technologies such as Docker containers and Kubernetes, which will help us to make the improvements mentioned above.

Let us know what you think about our plans! How do we compare with your own organization and other services you rely on? Is there more we could or should be doing? Do you have any advice for us based on your own experience? Contact us with your input and feedback!