August 9, 2023

D-Day: Events from the final day of migrating from GCP to AWS

Safepay’s decision to transition its infrastructure from Google Cloud Platform to Amazon Web Services marks a pivotal moment in its technological evolution. As a part of an enlightening series of blogs detailing this significant migration, this article shines a spotlight on the climactic final day of the transition journey. Within these lines, we delve into the challenges that emerged and the ingenious resolutions that paved the way for a seamless and triumphant migration. Join us as we unravel the intricacies of this crucial juncture, offering insights that illuminate the art of navigating complex cloud transitions.

Why choose AWS over GCP?

Over the past 3 years, we’ve been making the most of GCP to meet our operational needs. However, as we’re on the path to securing a commercial license from the State Bank of Pakistan, we’re stepping into a new phase of expansion. This move prompted us to rethink our existing infrastructure to ensure it’s ready for the demands of scalability and security. With demand on the rise and a growing array of threats, we understand the importance of building a solid foundation.

To tackle this, we formulated a strategic plan. Transitioning to AWS plays a central role, offering us a range of managed services like RDS, EKS, SQS, SNS, and IAM. This choice isn’t just about scalability; it’s also about gaining better control over our setup, giving us the flexibility to grow. And to handle varying levels of incoming traffic, AWS’s load balancing capabilities are on our side, ensuring that users have a smooth experience.

Yet, we did not stop at scalability — security was also top of mind. By partnering with Cloudflare’s proxy and Web Application Firewall (WAF), we added an extra layer of defense against evolving threats. This fusion of AWS’s managed services with Cloudflare’s strong security mechanisms forms the backbone of our protection strategy. This blend lets us bolster our infrastructure against potential vulnerabilities, making sure we’re rock-solid as we approach the milestone of securing our commercial license.

Preparation

In preparation for the migration of our production environment, we conducted a successful trial run by migrating our development and sandbox environments. During this preliminary phase, migrating sandbox took approximately 2 hours, providing us with valuable insights and lessons.

For the actual migration of our production environment, we meticulously planned and executed each step to ensure a smooth transition. Here’s an overview of our to-do list, along with the allocated time for each critical task:

Disable all merchants from receiving payments — 15 minutes
Switch DNS from GCP Cluster to a maintenance page — 5 minutes
Take database dump and upload to cloud Bucket — 5 minutes
Download database dump — 30 minutes
Restore dump on AWS RDS — 50 minutes
Switch DNS to AWS Cluster — 5 minutes
Test all payment intents and other services — 20 minutes
Enable all merchants — 15 minutes

With a well-executed strategic plan, our production migration was anticipated to be completed with minimal disruption, with an expected downtime of 4 hours which included a buffer of 1.5 hours.

The day of migration

On 31st July 2023, 11 am we began the activity and disabled all our merchants from accepting live payments. Now all our services were down once we switched the DNS to the maintenance page. For the first 2 hours everything went smoothly and we were able to restore the database dump. This is when we started to run into a number of issues.

Google domains do not allow CNAME record on zone APEX

As our service operated on AWS, creating an A record on Google Domains wasn’t viable, necessitating a CNAME record. However, a roadblock emerged — Google Domains doesn’t permit CNAMES for root or APEX domains, posing a complication.

Source: Google Support Thread

To address this challenge, we executed a strategic resolution: we seamlessly transferred all DNS records from Google Domains to our pre-existing enterprise Cloudflare account. This transition, remarkably swift at just around 10 minutes, involved importing records and establishing a CNAME on Cloudflare for our cluster(The initial plan was to migrate this separately after the cluster migration).

What sets Cloudflare apart is its CNAME flattening feature, which enabled us to create the necessary CNAME record for our root domain. This not only bypassed the limitations encountered on Google Domains but also introduced a performance enhancement. Cloudflare’s CNAME flattening optimizes DNS resolution, potentially resulting in speeds up to 30% faster.

For detailed information, refer to Cloudflare’s documentation on CNAME flattening: Cloudflare CNAME Flattening.

Mixed content error

Amid our migration efforts, a notable challenge arose due to mixed content issues stemming from the transition of the base URL from https://www.getsafepay.com to https://getsafepay.com. This alteration prompted a series of mixed content warnings, triggered by the protocol change. We had pointed both root domain and the one with www. to point to our cluster. On our nginx controller we added the following annotations.

‍

nginx.ingress.kubernetes.io/from-to-www-redirect: "true" ingress.kubernetes.io/force-ssl-redirect: "true"

While this redirection streamlined our migration process, a complication arose. Certain third-party and internal services attempted to access the URL https://www.getsafepay.com, which was being redirected, thereby generating the mixed content issue at hand.

To resolve this we updated our internal services to support https://getsafepay.com . For external services like Shopify, we updated our URL from the payment provider dashboard and then contacted them to approve our changes. This cost us significant time as all changes to Shopify must undergo a rigorous review process before they are approved.

DNS propagation takes up to 24 hours

While DNS propagation typically requires up to 24 hours for global implementation, we encountered a fluctuation in this process during our migration. Interestingly, our dev and sandbox environments saw rapid propagation within just an hour. However, during the recent transition, an extended delay emerged, hindering access to our frontend services. Notably, DNS checker tools indicated successful propagation across most regions, except certain areas in Pakistan and the USA that still pointed to our old GCP nameservers.

In response, we adopted a swift localized resolution strategy. To swiftly restore access for our team and merchants, we altered our DNS configuration, leveraging Google’s DNS server at 8.8.8.8. This immediate adjustment allowed access to our services, yet a subsequent issue surfaced: our frontend services couldn’t communicate with one another. Through rigorous log analysis, we pinpointed the root cause — the DNS configuration our EKS cluster was employing struggled to resolve our URLs.

Innovative thinking led us to Anushka Arora’s enlightening article, offering a solution. Drawing from Arora’s insights, we reconfigured our core-dns, seamlessly integrating Google’s DNS. This pivotal change rekindled communication between our services, reinstating functionality and reinforcing the idea that innovative problem-solving remains the cornerstone of successful cloud migration.

We operated with limited customer access while awaiting completion of DNS propagation within the 24-hour timeframe. For real-time updates on the availability of our services, kindly refer to our dedicated uptime page.

Conclusion

After dedicating 12 hours to monitoring and debugging, our team received a reminder of Murphy’s Law: “Anything that can go wrong will go wrong.” This lesson holds profound meaning for anyone setting out on a journey similar to ours. Challenges and unexpected glitches are an inherent part of the process, underlining the importance of having a backup plan in place. In our case, our preparedness with a readily available enterprise Cloudflare account enabled us to swiftly address the DNS situation.

In endeavors of this nature, the impact can extend to third-party integrations. Maintaining open lines of communication and keeping these parties updated about your plans can prove invaluable. We did not anticipate this which caused us some delay in pro

Lastly, it’s crucial to keep in mind that certain issues will inevitably fall beyond your control. Just as we encountered the unpredictable challenge of DNS propagation. However, even in such instances, ingenuity can come to the rescue. Innovative solutions might not completely eradicate the issue, but they can certainly go a long way in alleviating the problem.

‍

To learn more about Safepay, visit out website or follow us on LinkedIn and Instagram.

‍