Containerization
Last updated
Last updated
I have seen application containerization solutions mentioned for a while now, and I wanted to look into a couple of these solutions to be used for breach infrastructure, especially for breach data indexing (Apache Solr). From an outside perspective, it seems like a good way to have lightweight systems that allow for scaling up if required. The title, however, alludes to containerizing solutions in general, which I will use as an umbrella term to include virtual machines as well. I will be basing this blog on a Linux OS, but the tools will work on Windows and macOS as well.
This blog will discuss 3 different solutions:
Virtual Machines
Podman (similar to Docker)
MicroK8s (Kubernetes, which is similar to Docker Swarm)
When it comes to virtual machines, there are two main types of hypervisors: Type 1 (VMware ESXi, Microsoft Hyper-V, KVM) and Type 2 (Oracle VM VirtualBox, VMware Workstation, Microsoft Virtual PC). I would highly recommend planning your breach data environment prior to choosing a type of hypervisor because as your data and environment changes, it becomes a bit more tedious to move from one type to another. As such, I would recommend Type 2 for hobbyists, while for organizations, I would recommend Type 1. There are a lot of online tutorials that discuss the benefits of Type 1 in comparison to Type 2 and vice versa, so I will not discuss those here. In this section, I will discuss the following:
Operating System (OS) Recommendations
Scripting with Bash and Ansible
VM Escape
Virtual machines are great for having a separate and virtual system that allows for segmenting your project from the host OS. As you download different types of breaches, there is always a possibility for a file to be infected with malware. Having a separate system allows us to mitigate this situation with a high percentage of success rate. The steps for setting up a VM has been discussed a lot online, so I chose not to include it here. However, the steps should be similar to what is listed here: . The steps to download Solr are in the docs: . These are the steps I use when downloading Solr from source.
When it comes to OSes to put on a virtual machine, I always recommend Linux (Debian-based) as it is the most flexible for most things (excluding video games). You have a package manager at your fingertips that allows for easy access to download your applications. The OS itself is lighter in weight compared to macOS and Windows. You do have to apply security and privacy hardening on Linux for it to be secure and private, but due to its less bloated system, it is easier to do. There is minimal OS telemetry. It is highly customizable. However, all of this comes at a learning curve, which is not steep (but definitely can get steep for things like OS hardening, compilation flags, etc.). With that being said, when it comes to breach data environments, as of writing this blog (July 2024), I would recommend the following Linux OSes at what I would consider different Linux learning tiers:
Debian (Novice)
Kicksecure
Fedora Silverblue (Advanced Beginner)
Once the OS is setup, there are automated and repeatable ways to configure your environment. I have had great success with both Bash and Ansible, so I will be mentioning these here. In terms of automation of breach data infrastructure, I would recommend having a script/playbook that does the following:
Updates and upgrades the system
Updates and upgrades the packages/applications
Installs applications from package managers (Apt packages, Flatpaks, Snaps, etc.)
Installs apps from source (GitHub, GitLab, etc.)
Downloads and/or edits any configuration files
An example of this can be the following:
Updating and upgrading the system using apt
Installing ripgrep using apt
Downloading and Installing Apache Solr from source
Editing Solr configuration files based on your requirements
I, unfortunately, do not have any Bash scripts or Ansible playbooks that are specifically for breach data. If I end up working on those and if I remember to, I will update this blog with that information.
Bash is a command interpreter and a programming language for Linux [2]. Scripts can be written to automate command execution on your system. For example, if you want to have the same configuration on multiple machines, you would use Bash. See the following as an example to what a script can look like:
Scripting like this allows for repeated customization of your system. Simply have the script on the machine, and run it. This will go line by line and execute the commands on your system. This is great for having a running document (script) that you can rely on to set up your system. Although Bash is good for basic tasks, I would recommend Ansible more, as it has validation built into it, notifying you when errors are found.
TLDR: Think of a Bash script natively as UDP, while an Ansible playbook as TCP. Ansible checks to make sure commands are executed properly and the output is as expected, while bash natively does not.
Check the file extensions to make sure it isn't anything suspicious. No breach file should have a ".exe" extension. It is breach data, not a cracked video game
Run `head filename`. This allows you to see the first 10 lines of the file. You can then verify if this is an executable file or breach data
The following will be a list of commands you can use to get Apache Solr up and running. (This is not a script, but could easily be modified into one - if your file paths match with what the script has):
Just like that, you have Apache Solr setup and ready to go in a containerized environment. It should be available at http://localhost:8983
The following commands are just for maintenance:
Kubernetes is similar to Docker Swarm, in that both solutions allow you to scale your container up to scale for demand. This works well for high-demand systems, where resource intensive tasks are conducted. Think of Kubernetes as a way to replicate an image to multiple Podman containers, and scale it up to balance your workload.
So why MicroK8s? I can only test on the hardware / software that I do have. Kubernetes (and Docker Swarm) relies on having multiple machines where you can create multiple nodes to work simultaneously. Can I setup a network of VMs running simultaneously to create this network of machines? I can. Do I have the resources for that on my machine? Not at this moment. The good thing is that the similar to how Kubernetes is meant for multiple machines, MicroK8s allows you to do all that work on a small scale on one machine [5][6]. Since the commands are similar for both, this should not be a problem when setting up a bigger system.
Create the following files first, so that way you can run the commands back to back. On Linux, you can install gedit (sudo apt install gedit -y
) or use the command line editor like nano
or vim
to edit files. I would recommend creating a folder (mkdir breach-YAML
) where you can store the following files. This makes deployment much easier, since we can point Kubernetes to read a folder instead of 4 different files.
We will start off by creating a persistent volume. This will allow us to create a volume location on our local machine that Kubernetes via MicroK8s will recognize.
breach-pv.yaml
:
In my testing, instead of /mnt/test/
Kubernetes mounted to a sub-folder where snap applications store their files. To mitigate this, I went into a pod (after the infrastructure was up and running; after deployment) and then created a file called "hello.txt" in /opt/solr/server/solr/mydata
. I then ran find / -iname "hello.txt"
on my local machine, to see where this file was created. That will be the location you will need to upload your breach data to, for it to be readable by Kubernetes.
We then create a persistent volume claim. This allows each pod to "request" a capacity from the persistent volume. NOTE: The persistent volume claim cannot exceed the storage size of the persistent volume. Ex: If you have 10Gi storage capacity in the volume, keep the storage <= 10Gi in the volume claim.
breach-pvc.yaml
:
I initially chose the access mode to be ReadWriteMany
. This means "the volume can be mounted as read-write by many nodes" [16]. However, on reading this portion of the blog again and reviewing my work, I would recommend ReadOnlyMany
- "the volume can be mounted as read-only by many nodes" [16]. As of July 10, the code above reflects "ReadOnlyMany". I believe that the nodes should only be able to read the breach data and not be able to write to the volume. If you believe otherwise, feel free to use "ReadWriteMany".
This is the main file where we mention the image we plan to use: Apache Solr. In this file, you also connect the persistent volume claim to the image. This way you can connect your host to the pods without having to run additional commands.
breach-deployment.yaml
:
The last part we need, is to load balance our setup. If this ends up getting big to the point where multiple teams need to access breach data quickly, we need to have a way to provide for all of those users without having them visit another endpoint.
breach-service.yaml
:
I have combined all of the YAML information into one file and this did not work. Kubernetes only ran 1/4 YAML configs in the one large file, thus it needs to be broken down into separate files. However, if you just wanted all the configs in one location, see the following:
The IP Address is for the load balancer, but since only one pod has the information and the others don't, we start getting errors when it starts to query other nodes. To mitigate, I used the following shell script.
script.sh
:
In order to run commands, you would run the script and then the command right after. For example: bash script.sh "./bin/solr create_core -c breach"
. If you were using Kubernetes instead of MicroK8s, you would remove microk8s
from the script.
The following are the commands that I used to setup the Solr system.
I have recorded a video where I run these commands and show the output of this to some aspect. I do understand this blog is a bit more technical and more text-forward than my other blogs.
My recommendation for when to use what "container" depends on the situation. If you are a hobbyist, I would recommend using a virtual machine. It is straight-forward to setup, and you don't have to worry about learning a whole new technology or mitigating container issues. Virtual Machines (if you do not connect the from VM to host) allow you to segment the files in the virtual machine, preventing potential malware (to a high extent) from getting on to your host machine. If you want something a bit lightweight, then I can recommend Podman or Docker. These are lighter than virtual machines and take less work to setup than virtual machines when it comes to applications. However, with containers you have to understand how ports, volumes, etc. work in order to get your application in a state you want it. Finally, Kubernetes (by itself, and at a MicroK8s-level - including minikube, Kind, etc.) is what I would recommend for a team of users and for corporate environments. This is great for load balancing and high availability for your applications, and a hobbyist will most likely not deal with load or availability issues. If your team is leveraging Docker already, then Docker Swarm might be a better fit for you.
I ran into a lot of errors trying to setup minikube and Kind. MicroK8s ended up working for me, but I still wanted to share the errors I had encountered:
Docker rootless and Docker root-full led to me having the same error: When I SSH into the pod, it asks me if the Docker daemon is running...which minikube should have taken care of when initializing
With Kind I couldn't add Docker images from Docker Hub. I was limited to local images only
minikube with Docker was running into issues of misconfiguration, without me tweaking any settings. It worked once, but then it just stopped after
I used this blog as an excuse to play around with Kubernetes, as I wanted to see how container orchestration can be used for bigger breach data environments. I spent a lot of time trying out Kubernetes at a small scale with minikube and then Kind (for a little bit). MicroK8s just seem to have an easy setup, and had a more friendly tutorial to go with it. No driver configuration or anything needed. All that to say, I did learn a lot of Kubernetes while writing this blog. I would definitely recommend MicroK8s and Kubernetes to those who have higher loads and need high availability for their breach environments. I do hope my setup is beneficial to others in a similar situation.
https://aws.amazon.com/compare/the-difference-between-type-1-and-type-2-hypervisors/
https://www.gnu.org/software/bash/manual/bash.html#What-is-a-shell_003f
https://www.redhat.com/en/topics/containers/what-is-podman
https://github.com/containers/podman/blob/main/docs/tutorials/rootless_tutorial.md
https://stackoverflow.com/a/52027977 or https://stackoverflow.com/questions/52024961/difference-between-minikube-kubernetes-docker-compose-docker-swarm-etc
https://stackshare.io/stackups/kubernetes-vs-minikube
https://askubuntu.com/questions/40011/how-to-let-dpkg-i-install-dependencies-for-me
https://kubernetes.io/docs/tutorials/hello-minikube/
https://dev.to/donhadley22/deploying-a-simple-application-in-a-container-with-minikube-in-a-docker-runtime-3en8
https://adamtheautomator.com/apache-solr-tutorials/
https://stackoverflow.com/questions/61144060/minikube-driver-podman-has-anyone-been-able-to-get-it-to-work
https://minikube.sigs.k8s.io/docs/drivers/podman/
https://kubernetes.io/docs/tasks/tools/install-kubectl-linux/
https://iximiuz.com/en/posts/kubernetes-kind-load-docker-image/
https://kubernetes.io/docs/tasks/configure-pod-container/configure-persistent-volume-storage/
https://kubernetes.io/docs/concepts/storage/persistent-volumes/
is an entry-level operating system in the Linux ecosystem, although some consider Linux Mint to be that. Linux Mint (and Ubuntu) is based on Debian. I recommend this OS as the installation method for this OS and using the OS out of the box is really easy. You simply follow the installation prompts and the OS is good to go. For installing packages, the `apt` package manager gives you most tools, and Flatpak and Snap managers are one command away to be installed. There is a lot of tutorials and documentation for Debian in case one gets stuck, which is why I have this listed at the Novice level. If you get comfortable with Debian by itself, I highly recommend moving on to , which is hardened Debian.
is a semi-rolling release Fedora-based atomic/immutable distribution. I mention this in my blog due to my strong belief in having backups, which is what the atomic/immutable aspect of it allows you to have. When updating your system, if it fails, and your OS gets mis-configured, you can roll the OS back to a previous snapshot. This way you will always have a working OS. Working with breach data and setting up the infrastructure takes time, and in my opinion, having a backup, even at the OS-level, is needed. Silverblue also uses , which allows you to segment projects on your system. This would allow you to try other configurations on another project, without it affecting your main working project. Silverblue uses GNOME as the desktop, however if you prefer KDE, Sway, or Budgie, check out the other variants: .
Ansible is an IT automation software that allows you to create playbooks dictating what you want the end result to look like. Think of playbooks as a script (playbook) that calls multiple functions (tasks) in order to get a job done. Compared to Bash, Ansible does have a learning curve to it. I have written a , that is meant to speed-run the learning process a bit. I won't be mentioning all of the features of Ansible here, as it might go off of the scope of this blog a bit. The only feature worth mentioning is that it allows you to have a playbook, which repeats the same setup on all the systems, allowing systems to be the same in configuration. In addition, it works off of the end result, so it makes sure the specific change is made before it moves on to the next item. This is great for enterprise environments, but I would recommend it for personal use as well. I would recommend reading the as a starting point to catch up to speed on the technology and its benefits. If you wanted a playbook template to edit and use, I have a couple as well: .
Breach data comes in various formats, such as SQL dumps, text files, or in an archival formats (.7z, .zip, etc.). It is definitely possible that any of these could contain not only malware, but malware that exploits vulnerabilities in the VM in order to escape the environment (see: ). As such, defense in depth is a model that is important to keep in mind when dealing with anything security related. In order to be safe when dealing with breach data, I do the following:
If you are still uncertain, you can grab the SHA256 or MD5 hashes of the file and upload them to . Of course, you can also use to check this first. NOTE: MD5 and SHA-1 are known to have hash-collisions, so be cautious of this.
I was thinking of mentioning Docker here. However, I personally use . The reasoning is simple: "Podman stands out from other container engines because it’s daemonless, meaning it doesn't rely on a process with root privileges to run containers"[3]. It would be unfair for me to discuss Docker, when I myself do not use it or plan to use it. Most of the commands that work for Podman, should work for Docker as well (You will have to change podman
to docker
in your commands. If the aforementioned root-less architecture is a risk-accepted for you in relation to a little higher learning curve, feel free to use Docker as there is more support and documentation for it. The goal isn't to push an individual to use a specific solution, but to use a specific technology for their needs.
If you wanted to go with another container, you would just pull it. You could, of course, create your own Dockerfile (which Podman can read), if you want to set up a custom breach system. Maybe you have a custom app you built for breach data. In that case, using a Dockerfile would be great for you to set your environment up as a container. If you wanted to connect multiple containers, you can look into as well. The popular service search.0t.rocks
was using exactly this: . If you did want to learn docker, the following link has some great places to start: .
Apache Solr has a dedicated page for setting Solr up with Kubernetes (Apache Solr Operator) leveraging Helm and it uses Zookeeper for load balancing. I wanted to leverage a container image and work off of that for this blog. However, if you want to try their method, feel free to check out the dedicated page:
While working on this project, I did not find a way to run commands on multiple pods. Yes, there are ways to output YAML files of container configurations, but not what is in the container itself. For example, if I created a core (index) in Apache Solr on a pod and then pushed data to it, the other nodes would not have this change. See the following images as an example of what happens when I go to query that indexed data (indexed data is dummy data from ) :
Ran into an issue where minikube was having trouble running with Podman in the runtime container mentioned on . To mitigate this error, I then switched from Podman to Docker to continue working on this blog. This still did not resolve anything