Common issues while using Azure’s next-generation firewall

Getting the most out of your next-generation firewall | Network World

Recently I had to stand up a Next Generation Firewall (NGF) in an Azure Subscription as part of a Minimum Viable Product (MVP). This was a Palo Alto NGF with a number of templates that can help with the implementation.

I had to alter the template so the Application Gateway was not deployed. The client had decided on a standard External Load Balancer (ELB) so the additional features of an Application Gateway were not required. I then updated the parameters in the JSON file and deployed via an AzureDevOps Pipeline, and with a few run-throughs in my test subscription, everything was successfully deployed.

That’s fine, but after going through the configuration I realized the public IPs (PIPs) deployed as part of the template were “Basic” rather than “Standard.” When you deploy an Azure Load Balancer, there needs to be parity with any device PIPs you are balancing against. So, the PIPs were deleted and recreated as “Standard.” Likewise, the Internal Load Balancer (ILB) needed this too.

I had a PowerShell script from when I had stood up load balancers in the past and I modified this to keep everything repeatable. There would be two NGFs in two regions – 4 NGFs in total and two external load-balancers and two internal load-balancers.

A diagram from one region is shown below:

Firewall and Application Gateway for virtual networks - Azure Example Scenarios | Microsoft Docs

With all the load balancers in place, we should be able to pass traffic, right? Actually, no. Traffic didn’t seem to be passing.  An investigation revealed several gotchas.

Gotcha 1.  This wasn’t really a gotcha because I knew some Route Tables with User Defined Routing (UDR) would need to be set up. An example UDR on an internal subnet might be:

User Defined Route (UDR) – MyKloud

0.0.0.0/0 to Virtual Appliance pointing at the Private ILB IP Address. Also on the DMZ In subnet – where the Palo Alto Untrusted NIC sits, a UDR might be 0.0.0.0/0 to “Internet.” You should also have routes coming back the other way to the vNets. And, internally you can continue to allow Route Propagation if Express Route is in the mix, but on the Firewall Subnets, this should be disabled. Keep things tight and secure on those subnets.

But still no traffic after the Route Tables were configured.

Gotcha 2. The Palo Alto firewalls have a GUI ping utility in the user interface. Unfortunately, in the most current version of the Palo Alto Firewall OS (9 at the time of writing) the ping doesn’t work properly. This is because the firewall Interfaces are set to Dynamic Host Configuration Protocol (DHCP). I believe, as Azure controls and passes out the IPs to the Interfaces Static, DHCP is not required.

The way I decided to test things with this MVP, which is using a hub-and-spoke architecture, was to stand up a VM on a Non-Production Internal Spoke vNet.

Gotcha 3.  With all my UDRs set up with the load balancers and an internal VM trying to browse the internet, things are still not working. I now call a Palo Alto architect for input and learn the configuration on the firewalls is fine but there’s something not right with the load balancers.

At this point I was tempted to go down the Outbound Rules configuration route at the Azure CLI. I had used this before when splitting UDP and TCP Traffic to different PIPs on a Standard Load Balancer.

But I decided to take a step back and to start going through the load balancer configuration. I noticed that on my Health Probe I had set it to HTTP 80 as I had used this previously.

Health probe set to http 80

I changed it from HTTP 80 to TCP 80 in the Protocol box to see if it made a difference. I did this on both internal and external load balancers.

Hey, presto. Web Traffic started passing. The Health Probe hadn’t liked HTTP as the protocol as it was looking for a file and path.

Ok, well and good. I revisited the Azure Architecture Guide from Palo Alto and also discussed with a Palo Alto architect.

They mentioned SSH – Port 22 for health probes. I changed that accordingly to see if things still worked – and they did.

Port 22 for health probes

Finding the culprit

So, the health probe was the culprit — as was I for re-using PowerShell from a previous configuration. Even then, I’m not sure my eye would have picked up HTTP 80 vs TCP 80 the first time round. The health probe couldn’t access HTTP 80 Path / so it basically stopped all traffic, whereas TCP 80 doesn’t look for a path. Now we are ready to switch the Route Table UDRs to point Production Spoke vNets to the NGF.

To sum up the three gotchas:

  1. Configure your Route Tables and UDRs.
  2. Don’t use Ping to test with Azure Load Balancers
  3. Don’t use HTTP 80 for your Health Probe to NGFs.

Hopefully this will help circumvent some problems configuring load balancers with your NGFs when you are standing up an MVP – whatever flavour of NGF is used.

Developing for Azure autoscaling

Microsoft Azure Review 2020 - business.com

The public cloud (i.e. AWS, Azure, etc.) is often portrayed as a panacea for all that ails on-premises solutions. And along with this “cure-all” impression are a few misconceptions about the benefits of using the public cloud.

One common misconception pertains to autoscaling, the ability to automatically scale up or down the number of compute resources being allocated to an application based on its needs at any given time.  While Azure makes autoscaling much easier in certain configurations, parts of Azure don’t as easily support autoscaling.

For example, if you look at the different application service plans, you will see the lower three tiers (Free, Shared and Basic) do not include support for auto-scaling like the top 4 tiers (Standard and above). And there are ways to design and architect your solution to make use of auto-scaling. The point being, just because your application is running in Azure does not necessarily mean you automatically get autoscaling.

Scale out or scale in

How to Autoscale Azure App Services & Cloud Services

In Azure, you can scale up vertically by changing the size of a VM, but the more popular way Azure scales is to scale-out horizontally by adding more instances. Azure provides horizontal autoscaling via numerous technologies. For example, Azure Cloud Services, the legacy technology, provides autoscaling automatically at the role level. Azure Service Fabric and virtual machines implement autoscaling via virtual machine scale sets. And, as mentioned, Azure App Service has built in autoscaling for certain tiers.

When it is known ahead of time that a certain date or time period (such as Black Friday) will warrant the need for scaling-out horizontally to meet anticipated peak demands, you can create a static scheduled scaling. This is not in the true sense “auto” scaling. Rather, the ability to dynamically and reactively auto-scale is typically based upon runtime metrics that reflect a sudden increase in demand.  Monitoring metrics with compensatory instance adjustment actions when a metric reaches a certain value is a traditional way to dynamically auto-scale.

Tools for autoscaling

Azure Monitor overview - Azure Monitor | Microsoft Docs

Azure Monitor provides that metric monitoring with auto-scale capabilities. Azure Cloud Services, VMs, Service Fabric, and VM scale sets can all leverage Azure Monitor to trigger and manage auto-scaling needs via rules. Typically, these scaling rules are based on related memory, disk and CPU-based metrics.

For applications that require custom autoscaling, it can be done using metrics from Application Insights.  When you create an Azure application and you want to scale it, you should make sure to enable App Insights for proper scaling. You can create a customer metric in code and then set up an autoscale rule using that custom metric via metric source of Application Insights in the portal.

Design considerations for autoscaling

Autoscaling v1 - Azure App Service Environment | Microsoft Docs

When writing an application that you know will be auto-scaled at some point, there are a few base implementation concepts you might want to consider:

  • Use durable storage to store your shared data across instances. That way any instance can access the storage location and you don’t have instance affinity to a storage entity.
  • Seek to use only stateless services. That way you don’t have to make any assumptions on which service instance will access data or handle a message.
  • Realize that different parts of the system have different scaling requirements (which is one of the main motivators behind microservices). You should separate them into smaller discrete and independent units so they can be scaled independently.
  • Avoid any operations or tasks that are long-running. This can be facilitated by decomposing a long-running task into a group of smaller units that can be scaled as needed. You can use what’s called a Pipes and Filters pattern to convert a complex process into units that can be scaled independently.

Scaling/throttling considerations

Autoscaling can be used to keep the provisioned resources matched to user needs at any given time. But while autoscaling can trigger the provisioning of additional resources as needs dictate, this provisioning isn’t immediate. If demand unexpectedly increases quickly, there can be a window where there’s a resource deficit because they cannot be provisioned fast enough.

An alternative strategy to auto-scaling is to allow applications to use resources only up to a limit and then “throttle” them when this limit is reached. Throttling may need to occur when scaling up or down since that’s the period when resources are being allocated (scale up) and released (scale down).

The system should monitor how it’s using resources so that, when usage exceeds the threshold, it can throttle requests from one or more users. This will enable the system to continue functioning and meet any service level agreements (SLAs). You need to consider throttling and scaling together when figuring out your auto-scaling architecture.

Singleton instances

Creating a Logic Apps Singleton Instance – Jeroen Maes' Integration Blog

Of course, auto-scaling won’t do you much good if the problem you are trying to address stems from the fact that your application is based on a single cloud instance. Since there is only one shared instance, a traditional singleton object goes against the positives of the multi-instance high scalability approach of the cloud. Every client uses that same single shared instance and a bottleneck will typically occur. Scalability is thus not good in this case so try to avoid a traditional singleton instance if possible.

But if you do need to have a singleton object, instead create a stateful object using Service Fabric with its state shared across all the different instances.  A singleton object is defined by its single state. So, we can have many instances of the object sharing state between them. Service Fabric maintains the state automatically, so we don’t have to worry about it.

The Service Fabric object type to create is either a stateless web service or a worker service. This works like a worker role in an Azure Cloud Service.

Knowing how to use Azure Databricks and resource groupings

6 Reasons to Use Azure Databricks Today

Azure Databricks, an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud, is a highly effective open-source tool, but it automatically creates resource groups and workspaces and protects them with a system-level lock, all of which can be confusing and frustrating unless you understand how and why.

The Databricks platform provides an interactive workspace that streamlines collaboration between data scientists, data engineers and business analysts. The Spark analytics engine supports machine learning and large-scale distributed data processing, combining many aspects of big data analysis all in one process.

Spark works on large volumes of data either in batch (rest) or streaming processing (live) mode. The live processing capability is how Databricks/Spark differs from Hadoop (which uses MapReduce algorithms to process only batch data).

Resource groups are key to managing the resources bound to Databricks. Typically, you specify which groups in which your resources are created. This changes slightly when you create an Azure Databricks service instance and specify a new or existing resource group. Say, for example, we are creating a new resource group, Azure will create the group and place a workspace within it. That workspace is an instance of the Azure Databricks service.

Along with the directly specified resource group, it will also create a second resource group. This is called a “Managed resource group” and it starts with the word “databricks.” This Azure-managed group of resources allows Azure to provide Databricks as a managed service. Initially this managed resource group will contain only a few workspace resources (a virtual network, a security group and a storage account). Later, when you create a cluster, the associated resources for that cluster will be linked to this managed resource group.

The “databricks-xxx” resource group is locked when it is created since the resources in this group provide the Databricks service to the user. You are not able to directly delete the locked group nor directly delete the system-owned lock for that group. The only option is to delete the service, which in turn deletes the infrastructure lock.

A beginner's guide to Azure Databricks

With respect to Azure tagging, the lock placed upon that Databricks managed resource group prevents you from adding any custom tags, from deleting any of the resources or doing any write operations on a managed resource group resource.

Example Deployment

Let’s talk a look at what happens when you create an instance of the Azure Databricks service with respect to resources and resource groups:

Steps

  1. Create an instance of the Azure Databricks service
  2. Specify the name of the workspace (here we used nwoekcmdbworkspace)
  3. Specify to create a new resource group (here we used nwoekcmdbrg) or choose an existing one
  4. Hit Create

Results

  1. Creates nwoekcmdbrg resource group
  2. Automatically creates nwoekcmdbworkspace, which is the Azure Databricks Service. This is contained within the nwoekcmdbrg resource group.
  3. Automatically creates databricks-rg-nwoekcmdbworkspace-c3krtklkhw7km resource group. This contains a single storage account, a network security group and a virtual network.

 

Click on the workspace (Azure Databricks service), and it brings up the workspace with a “Launch Workspace” button.

Per-workspace URLs - Azure Databricks - Workspace | Microsoft Docs

Launching the workspace uses AAD to sign you into the Azure Databricks service. This is where you can create a Databrick cluster or run queries, import data, create a table, or create a notebook to start querying, visualizing and modifying your data. I decided to create a new cluster to demonstrate where the resources are stored for the appliance. Here, we create a cluster to see where the resources land.

Azure Databricks - create new workspace and cluster | SQL Player

After the cluster is created, a number of resources were created in the Azure Databrick managed resource group databricks-rg-nwoekcmdbworkspace-c3krtklkhw7km. Instead of merely containing a single VNet, NSG and storage account as it did initially, it now contains multiple VMs, disks, network interfaces, and public IP addresses.

Quickstart - Run a Spark job on Azure Databricks Workspace using Azure portal | Microsoft Docs

The workspace nwoekcmdbworkspace and the original resource group nwoekcmdbrg both remain unchanged as all changes are made in the managed resource group databricks-rg-nwoekcmdbworkspace-c3krtklkhw7km. If you click on “Locks,” you can see there is a read-only lock placed on it to prevent deletion. Clicking on the “Delete” button yields an error saying the lock was not able to be deleted. If you make changes to the original resource group in the tags, they will be reflected in the “databricks-xxx” resource group.  But you cannot change tag values in the databricks-xxx resource group directly.

Quick Tip:How to prevent your Azure Resources from accidental deletion? – Beyond the Horizon…

Summary

When using Azure Databricks, it can be confusing when a new workspace and managed resource group just appear. Azure automatically creates a Databricks workspace, as well as a managed resource group containing all the resources needed to run the cluster. This is protected by a system-level lock to prevent deletions and modifications. The only way to directly remove the lock is to delete the service. This can be a tremendous limitation if changes need to be made to tags in the managed resource group.  However, by making changes to the parent resource group, those changes will be correspondingly updated in the managed resource group.

error: Content is protected !!