40+ Linux Commands for a Better Machine Learning Workflow

Linux is the backbone of many machine learning (ML) workflows. With its powerful command-line interface, Linux gives engineers the flexibility and control needed for a smooth ML experience.

Over the past decade, I’ve come to understand the significance of mastering a variety of Linux commands to boost productivity, streamline tasks, and manage resources efficiently.

Whether you’re setting up an environment, managing files, or optimizing code, Linux provides a robust toolkit to support your machine learning journey.

This article covers the essential Linux commands that every machine learning engineer should know, with explanations designed for beginners but detailed enough for experienced users.

1. Navigating the File System

A major part of working with Linux is efficiently navigating the file system. As a machine learning engineer, you’ll be constantly dealing with data files, models, code, and results. Mastering basic navigation commands is critical.

cd (Change Directory)

The cd command is used to change the current working directory, which is fundamental when moving between directories.

cd /path/to/directory

ls (List Directory Contents)

Once you’re in a directory, you can use the ls command see what files or subdirectories are in your current location.

ls

You can use ls -l for a detailed listing or ls -a to show hidden files.

ls -l
ls -a

pwd (Print Working Directory)

Use pwd command to display the absolute path of the current working directory, which is helpful command when you need to confirm where you are in the file system.

pwd

mkdir (Make Directory)

As a machine learning engineer, you’ll need to use mkdir command to create directories for different datasets, models, or experiment results.

mkdir new_directory

rm (Remove Files and Directories)

When cleaning up system, you may need to delete files or directories using the rm command.

rm filename
rm -r directory_name

2. File Management and Searching

Working with data, code, and models requires handling large amounts of files. Linux provides powerful tools for managing, searching, and manipulating files.

find (Search for Files)

find is a powerful command to search for files and directories based on specific criteria like name, type, or modification date.

find /path/to/search -name "filename"

This command searches for a file named “filename” in the specified directory and its subdirectories.

grep (Search Inside Files)

grep command lets you search for patterns inside files, this is useful when working with large datasets or scripts, searching for a specific term inside files.

grep "pattern" file.txt

To search recursively within a directory, use:

grep -r "pattern" /path/to/directory

cp (Copy Files)

Use cp command for copying files and directories, this is useful when creating backups or replicating datasets.

cp source_file destination_file
cp -r source_directory destination_directory

mv (Move or Rename Files)

mv command allows you to move files between directories or rename them.

mv old_filename new_filename
mv file_name /path/to/destination/

tar (Compress Files)

Use tar command to compress and archive files such as large datasets and models.

tar -cvf archive.tar /path/to/directory
tar -xvf archive.tar

The -c option creates the archive, -x extracts it, and -v makes the operation verbose.

chmod (Change File Permissions)

Use chmod command to change the read, write, and execute permissions of code or scripts.

chmod 755 script.sh

This sets read, write, and execute permissions for the owner and read-execute permissions for others.

3. Linux Process Management

Managing processes is a key part of optimizing your machine learning workflow. Linux commands provide tools to monitor, control, and manage processes running on your machine.

ps (Display Running Processes)

The ps command shows a snapshot of current processes.

ps aux

To view processes related to Python, you could use:

ps aux | grep python

top (Monitor System Resources)

top command is a real-time task manager that displays CPU, memory, and process information, which is helpful to monitor resource usage during long-running ML tasks.

top

You can use htop, a more user-friendly version, if it’s installed.

kill (Terminate Processes)

If a process is consuming too many resources or hanging, you can terminate it using kill command with the help of the process ID (PID).

kill PID

You can find the PID by using ps aux or top.

nice/renice (Manage Process Priority)

When running resource-intensive tasks, like training a machine learning model, you may want to adjust process priorities by using the nice and renice commands.

nice -n 10 python train.py
renice -n -10 PID

nice starts a process with a specific priority, while renice adjusts the priority of a running process.

4. Linux Resource Monitoring

Efficient resource management is crucial for machine learning tasks, as many of them are computationally expensive, but Linux provides tools for monitoring your system’s performance.

free (Check Memory Usage)

When working with large datasets and models, memory usage is a common concern, but the free command gives you an overview of your system’s memory status.

free -h

The -h flag makes the output human-readable (i.e., displaying in MB or GB).

df (Disk Space Usage)

Monitoring available disk space is essential, especially when storing large datasets and the df command gives a summary of disk space usage for your mounted file systems.

df -h

iotop (Monitor Disk I/O)

If you want to monitor disk I/O, iotop can show you which processes are using the disk the most.

sudo iotop

You’ll need to run it with sudo for full access to disk information.

nvidia-smi (Monitor GPU Usage)

For machine learning engineers using GPUs (such as NVIDIA GPUs), the nvidia-smi command provides critical information about GPU usage, memory usage, and active processes.

nvidia-smi

It’s essential for tracking the status of your GPU during deep learning model training.

5. Linux Package Management

Linux provides package managers that help install, update, and remove software packages. As an ML engineer, you’ll be constantly installing libraries and frameworks.

apt (Debian/Ubuntu/Mint)

If you’re using a Debian-based distribution like Ubuntu, apt is your go-to tool for installing software.

sudo apt update
sudo apt install python3-pip

yum/dnf (RHEL/Rocky/Alma Linux

For Red Hat-based distributions (like CentOS or Fedora), yum and dnf manages software packages.

sudo yum install python3-pip
OR
sudo dnf install python3-pip

pip (Python Package Management)

Python is the language of choice for machine learning, so you’ll often use pip command to install libraries like TensorFlow, PyTorch, or Scikit-learn.

pip install tensorflow

conda (Managing Environments and Packages)

When working with multiple Python environments, conda is an excellent tool that helps manage dependencies, libraries, and even non-Python packages.

conda create --name ml_env python=3.8
conda activate ml_env
conda install tensorflow

6. Linux Networking Commands

Machine learning engineers often work in distributed environments, which makes networking knowledge essential for tasks like data transfer, cluster management, or cloud computing.

scp (Secure Copy)

To transfer data securely between machines, use scp command, which is particularly useful for ML engineers working on remote servers or distributed setups.

scp local_file username@remote_host:/path/to/destination

rsync (Remote Synchronization)

rsync is another excellent tool for copying or syncing files between machines or directories, which is faster than scp because it only transfers changes.

rsync -avz /path/to/source/ username@remote_host:/path/to/destination

ssh (Secure Shell)

Securely connect to remote servers using ssh command, which is essential for remotely executing scripts, managing models, or running experiments on cloud infrastructure.

ssh username@remote_host

7. Git for Version Control

Git is essential for managing code versions, collaborating with teams, and keeping track of changes.

git clone (Clone a Repository)

To get started with a project from GitHub, you can clone a repository.

git clone https://github.com/user/repository.git

git status (Check Repository Status)

Before committing changes, check the status of your working directory.

git status

git commit (Commit Changes)

When you’re ready to save your changes to the repository, use git commit.

git commit -m "Commit message"

This command commits your changes with a descriptive message to explain what was modified.

git push (Push Changes)

After committing changes locally, use git push to push them to the remote repository.

git push origin branch_name

This uploads your changes to the specified branch on the remote repository (such as GitHub).

git pull (Pull Updates)

To update your local repository with the latest changes from the remote repository, use git pull.

git pull origin branch_name

This ensures you are always working with the latest codebase and prevents conflicts when collaborating with teammates.

git branch (Create or List Branches)

Git branches are useful for experimenting with different features or versions of your ML model without affecting the main codebase.

git branch
git branch new_feature_branch

8. Virtual Environments and Dependency Management

Managing Python environments and dependencies is crucial when working on multiple machine learning projects, each with different versions of libraries. Here are a few commands to manage virtual environments and dependencies efficiently.

Create a Virtual Environment

To create a virtual environment, you can use the following command:

python3 -m venv env_name

This sets up an isolated Python environment, preventing conflicts between project dependencies.

Activate Virtual Environment

To activate the virtual environment and work inside it, use the following command:

source env_name/bin/activate

Once activated, you can install packages and run Python scripts specific to that environment.

Deactivate Virtual Environment

When you’re done working within a virtual environment, use deactivate to exit and return to the system’s default Python environment.

deactivate

List Installed Packages

To see all installed packages in your virtual environment or system-wide, use:

pip freeze

This shows all installed Python libraries and their versions, which is useful for creating requirements files.

Install Dependencies from a Requirements File

If you’re collaborating on a project, you’ll often share a requirements.txt file that lists all the libraries needed. You can install all dependencies from that file using:

pip install -r requirements.txt

9. Monitoring and Logging

Machine learning experiments, especially when training large models, can take a long time. Monitoring progress and logging output are critical for tracking experiments, debugging, and optimizing code.

tail (View the End of Files)

When checking logs, you often want to view the latest entries, the tail command displays the last few lines of a file.

tail -f log_file.log

The -f option allows you to view new log entries in real-time, which is useful for monitoring live experiments or model training processes.

watch (Run Commands Repeatedly)

For real-time monitoring of system performance or model training, use watch command to execute a command at regular intervals.

watch -n 1 nvidia-smi

This will update the GPU status every second, allowing you to monitor GPU usage during model training.

10. Disk Usage Analysis

Managing disk space effectively is essential, especially when handling large datasets or saving models. These commands help you analyze and manage disk usage.

du (Disk Usage)

To check the disk usage of a file or directory, use du command, which is particularly useful for checking how much space large datasets or models are consuming.

du -sh /path/to/directory

The -s option provides a summary, while -h makes the output human-readable.

ncdu (Interactive Disk Usage Analyzer)

For a more user-friendly disk usage analysis, ncdu is an excellent tool, which provides an interactive interface to explore disk usage.

ncdu /path/to/directory

11. Automating Tasks in Linux

Automation is essential for improving efficiency and avoiding repetitive tasks. Linux has several tools that make it easy to automate workflows in your machine learning projects.

cron (Schedule Tasks)

The cron utility allows you to schedule jobs to run at specific intervals. You can use cron to automate tasks like running model training scripts or backing up datasets.

crontab -e

This command opens the cron configuration file. You can add entries to run scripts at specific times, for example, daily or weekly.

at (Schedule One-Time Tasks)

For one-time scheduled tasks, use the at command, which is useful when you need a task to execute once at a certain time.

echo "python train_model.py" | at 2:00 PM

12. System and Resource Optimization

Machine learning tasks can be resource-intensive, and optimizing your system’s performance can help reduce training times and improve the overall efficiency of your experiments. Linux provides several commands to optimize and manage system resources effectively.

swapon (Enable Swap Space)

If your system runs out of RAM during memory-intensive tasks like training large models, swap space can act as overflow memory.

sudo swapon /swapfile

sysctl (Modify Kernel Parameters)

Linux offers sysctl for tuning kernel parameters to optimize system performance, which is especially useful when running deep learning workloads.

sysctl -w vm.swappiness=10

This example sets the swappiness value, which controls how often the system swaps data from RAM to disk.

13. Working with Containers

Containers are essential for managing machine learning environments. Whether you’re using Docker or Kubernetes, these tools help streamline the deployment of machine learning models in a reproducible and isolated environment.

docker (Manage Containers)

Docker is the most popular containerization tool that you can use it to build, manage, and run containers that package your ML models and environments.

docker build -t ml_model .
docker run -it ml_model

These commands allow you to create a Docker image for your ML model and run it in an isolated container.

docker-compose (Manage Multi-Container Applications)

For more complex setups involving multiple containers, docker-compose is the tool to use, which allows you to define and manage multi-container applications using a single configuration file.

docker-compose up

14. Security Best Practices

When working with sensitive data or deploying models in production, security becomes a major concern. Linux offers a variety of commands to help secure your environment and maintain data confidentiality.

chmod/chown (Change Permissions/Ownership)

It’s important to restrict access to sensitive data files or scripts with the help of chmod to set file permissions and chown to change file ownership.

chmod 700 sensitive_data.csv
chown user:user sensitive_data.csv
Conclusion

Linux commands is essential for every machine learning engineer. From managing files and resources to automating tasks and optimizing performance, Linux commands enable you to work more efficiently, streamline workflows, and ensure your projects run smoothly.

Whether you’re a beginner or an experienced user, familiarizing yourself with these top Linux commands will help you navigate your ML projects with ease.

In addition to the basics, you’ll find that Linux’s flexibility and powerful ecosystem allow you to tailor your environment to your specific needs. The more you use these commands, the faster and more productive you’ll become, enabling you to focus on what truly matters: building better models and achieving great results.

Similar Posts