10 Essential Linux Commands for Data Scientists

If you’re just starting your journey into data science, you might think it’s all about Python libraries, Jupyter notebooks, and fancy machine learning algorithms and while those are definitely important, there’s a powerful set of tools that often gets overlooked: the humble command line.

I’ve spent over a decade working with Linux systems, and I can tell you that mastering these command-line tools will make your life significantly easier. They’re fast, efficient, and often the quickest way to peek at your data, clean files, or automate repetitive tasks.

To make this tutorial practical and hands-on, we’ll use a sample e-commerce sales dataset throughout this article. Let me show you how to create it first, then we’ll explore it using all 10 tools.

Create the sample file:

cat > sales_data.csv << 'EOF'
order_id,date,customer_name,product,category,quantity,price,region,status
1001,2024-01-15,John Smith,Laptop,Electronics,1,899.99,North,completed
1002,2024-01-16,Sarah Johnson,Mouse,Electronics,2,24.99,South,completed
1003,2024-01-16,Mike Brown,Desk Chair,Furniture,1,199.99,East,completed
1004,2024-01-17,John Smith,Keyboard,Electronics,1,79.99,North,completed
1005,2024-01-18,Emily Davis,Notebook,Stationery,5,12.99,West,completed
1006,2024-01-18,Sarah Johnson,Laptop,Electronics,1,899.99,South,pending
1007,2024-01-19,Chris Wilson,Monitor,Electronics,2,299.99,North,completed
1008,2024-01-20,John Smith,USB Cable,Electronics,3,9.99,North,completed
1009,2024-01-20,Anna Martinez,Desk,Furniture,1,399.99,East,completed
1010,2024-01-21,Mike Brown,Laptop,Electronics,1,899.99,East,cancelled
1011,2024-01-22,Emily Davis,Pen Set,Stationery,10,5.99,West,completed
1012,2024-01-22,Sarah Johnson,Monitor,Electronics,1,299.99,South,completed
1013,2024-01-23,Chris Wilson,Desk Chair,Furniture,2,199.99,North,completed
1014,2024-01-24,Anna Martinez,Laptop,Electronics,1,899.99,East,completed
1015,2024-01-25,John Smith,Mouse Pad,Electronics,1,14.99,North,completed
1016,2024-01-26,Mike Brown,Bookshelf,Furniture,1,149.99,East,completed
1017,2024-01-27,Emily Davis,Highlighter,Stationery,8,3.99,West,completed
1018,2024-01-28,NULL,Laptop,Electronics,1,899.99,South,pending
1019,2024-01-29,Chris Wilson,Webcam,Electronics,1,89.99,North,completed
1020,2024-01-30,Sarah Johnson,Desk Lamp,Furniture,2,49.99,South,completed
EOF

Now let’s explore this file using our 10 essential tools!

1. grep – Your Pattern-Matching Tool

Think of grep command as your data detective, as it searches through files and finds lines that match patterns you specify, which is incredibly useful when you’re dealing with large log files or text datasets.

Example 1: Find all orders from John Smith.

grep "John Smith" sales_data.csv

Example 2: Count how many laptop orders we have.

grep -c "Laptop" sales_data.csv

Example 3: Find all orders that are NOT completed.

grep -v "completed" sales_data.csv | grep -v "order_id"

Example 4: Find orders with line numbers.

grep -n "Electronics" sales_data.csv | head -5

grep: The Easiest Way to Search Through Text Files

2. awk – The Swiss Army Knife for Text Processing

awk is like a mini programming language designed for text processing, which is perfect for extracting specific columns, performing calculations, and transforming data on the fly.

Example 1: Extract just product names and prices.

awk -F',' '{print $4, $7}' sales_data.csv | head -6

Example 2: Calculate total revenue from all orders.

awk -F',' 'NR>1 {sum+=$7} END {print "Total Revenue: $" sum}' sales_data.csv

Example 3: Show orders where the price is greater than $100.

awk -F',' 'NR>1 && $7>100 {print $1, $4, $7}' sales_data.csv

Example 4: Calculate the average price by category.

awk -F',' 'NR>1 {
    category[$5]+=$7
    count[$5]++
} 
END {
    for (cat in category) 
        printf "%s: $%.2fn", cat, category[cat]/count[cat]
}' sales_data.csv

Using awk to Process Data on the Command Line

3. sed – The Stream Editor for Quick Edits

sed is your go-to tool for find-and-replace operations and text transformations. It’s like doing “find and replace” in a text editor, but from the command line and much faster.

Example 1: Replace NULL values with “Unknown“.

sed 's/NULL/Unknown/g' sales_data.csv | grep "Unknown"

Example 2: Remove the header line.

sed '1d' sales_data.csv | head -3

Example 3: Change “completed” to “DONE“.

sed 's/completed/DONE/g' sales_data.csv | tail -5

Example 4: Add a dollar sign before all prices.

sed 's/,([0-9]*.[0-9]*),/,$,/g' sales_data.csv | head -4

sed: The Fastest Way to Find and Replace Text

4. cut – Simple Column Extraction

While awk is powerful, sometimes you just need something simple and fast, that’s where cut command comes in, which is specifically designed to extract columns from delimited files.

Example 1: Extract customer names and products.

cut -d',' -f3,4 sales_data.csv | head -6

Example 2: Extract only the region column.

cut -d',' -f8 sales_data.csv | head -8

Example 3: Get order ID, product, and status.

cut -d',' -f1,4,9 sales_data.csv | head -6

5. sort – Organize Your Data

Sorting data is fundamental to analysis, and the sort command does this incredibly efficiently, even with files that are too large to fit in memory.

Example 1: Sort by customer name alphabetically.

sort -t',' -k3 sales_data.csv | head -6

Example 2: Sort by price (highest to lowest).

sort -t',' -k7 -rn sales_data.csv | head -6

Example 3: Sort by region, then by price.

sort -t',' -k8,8 -k7,7rn sales_data.csv | grep -v "order_id" | head -8

6. uniq – Find and Count Unique Values

uniq command helps you identify unique values, count occurrences, and find duplicates, which is like a lightweight version of pandas’ value_counts().

uniq only works on sorted data, so you’ll usually pipe it with sort.

Example 1: Count orders by region.

cut -d',' -f8 sales_data.csv | tail -n +2 | sort | uniq -c

Example 2: Count orders by product category.

cut -d',' -f5 sales_data.csv | tail -n +2 | sort | uniq -c | sort -rn

Example 3: Find which customers made multiple purchases.

cut -d',' -f3 sales_data.csv | tail -n +2 | sort | uniq -c | sort -rn

Example 4: Show unique products ordered.

cut -d',' -f4 sales_data.csv | tail -n +2 | sort | uniq

uniq: Find, Count, and Filter Unique Data

7. wc – Word Count (and More)

Don’t let the name fool you, wc (word count) is useful for much more than counting words, which is your quick statistics tool.

Example 1: Count the total number of orders (minus header).

wc -l sales_data.csv

Example 2: Count how many electronics orders.

grep "Electronics" sales_data.csv | wc -l

Example 3: Count total characters in the file.

wc -c sales_data.csv

Example 4: Multiple statistics at once.

wc sales_data.csv

8. head and tail – Preview Your Data

Instead of opening a massive file, use head command to see the first few lines or tail to see the last few.

Example 1: View the first 5 orders.

head -6 sales_data.csv

Example 2: View just the column headers.

head -1 sales_data.csv

Example 3: View the last 5 orders.

tail -5 sales_data.csv

Example 4: Skip the header and see the data

tail -n +2 sales_data.csv | head -3

Quickly Preview Files with Head and Tail Commands

9. find – Locate Files Across Directories

When working on projects, you often need to find files scattered across directories, and the find command is incredibly powerful for this.

First, let’s create a realistic directory structure:

mkdir -p data_project/{raw,processed,reports}
cp sales_data.csv data_project/raw/
cp sales_data.csv data_project/processed/sales_cleaned.csv
echo "Summary report" > data_project/reports/summary.txt

Example 1: Find all CSV files.

find data_project -name "*.csv"

Example 2: Find files modified in the last minute.

find data_project -name "*.csv" -mmin -1

Example 3: Find and count lines in all CSV files.

find data_project -name "*.csv" -exec wc -l {} ;

Example 4: Find files larger than 1KB.

find data_project -type f -size +1k

Quickly Find Files Across Directories Using the Find Command

10. jq – JSON Processor Extraordinaire

In modern data science, a lot of information comes from APIs, which usually send data in JSON format, a structured way of organizing information.

While tools like grep, awk, and sed are great for searching and manipulating plain text, jq is built specifically for handling JSON data.

sudo apt install jq    # Ubuntu/Debian
sudo yum install jq    # CentOS/RHEL

Let’s convert some of our data to JSON format first:

cat > sales_sample.json << 'EOF'
{
  "orders": [
    {
      "order_id": 1001,
      "customer": "John Smith",
      "product": "Laptop",
      "price": 899.99,
      "region": "North",
      "status": "completed"
    },
    {
      "order_id": 1002,
      "customer": "Sarah Johnson",
      "product": "Mouse",
      "price": 24.99,
      "region": "South",
      "status": "completed"
    },
    {
      "order_id": 1006,
      "customer": "Sarah Johnson",
      "product": "Laptop",
      "price": 899.99,
      "region": "South",
      "status": "pending"
    }
  ]
}
EOF

Example 1: Pretty-print JSON.

jq '.' sales_sample.json

Example 2: Extract all customer names.

jq '.orders[].customer' sales_sample.json

Example 3: Filter orders over $100.

jq '.orders[] | select(.price > 100)' sales_sample.json

Example 4: Convert to CSV format.

jq -r '.orders[] | [.order_id, .customer, .product, .price] | @csv' sales_sample.json

Bonus: Combining Tools with Pipes

Here’s where the magic really happens: you can chain these tools together using pipes (|) to create powerful data processing pipelines.

Example 1: Find the 10 most common words in a text file:

cat article.txt | tr '[:upper:]' '[:lower:]' | tr -s ' ' 'n' | sort | uniq -c | sort -rn | head -10

Example 2: Analyze web server logs:

cat access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -20

Example 3: Quick data exploration:

cut -d',' -f3 sales.csv | tail -n +2 | sort -n | uniq -c

Practical Workflow Example

Let me show you how these tools work together in a real scenario. Imagine you have a large CSV file with sales data, and you want to:

Remove the header.
Extract the product name and price columns.
Find the top 10 most expensive products.

Here’s the one-liner:

tail -n +2 sales.csv | cut -d',' -f2,5 | sort -t',' -k2 -rn | head -10

Breaking it down:

tail -n +2: Skip the header row.
cut -d',' -f2,5: Extract columns 2 and 5.
sort -t',' -k2 -rn: Sort by second field, numerically, reverse order.
head -10: Show top 10 results.

Conclusion

These 10 command-line tools are like having a Swiss Army knife for data. They’re fast, efficient, and once you get comfortable with them, you’ll find yourself reaching for them constantly, even when you’re working on Python projects.

Start with the basics: head, tail, wc, and grep. Once those feel natural, add cut, sort, and uniq to your arsenal. Finally, level up with awk, sed, and jq.

Remember, you don’t need to memorize everything. Keep this guide bookmarked, and refer back to it when you need a specific tool. Over time, these commands will become second nature.

10 Essential Linux Commands for Data Scientists

1. grep – Your Pattern-Matching Tool

2. awk – The Swiss Army Knife for Text Processing

3. sed – The Stream Editor for Quick Edits

4. cut – Simple Column Extraction

5. sort – Organize Your Data

6. uniq – Find and Count Unique Values

7. wc – Word Count (and More)

8. head and tail – Preview Your Data

9. find – Locate Files Across Directories

10. jq – JSON Processor Extraordinaire

Bonus: Combining Tools with Pipes

Practical Workflow Example

Conclusion

CrowdStrike to Acquire Onum to Transform How Data Powers the Agentic SOC

How to Setup WireGuard VPN Server with WireGuard-UI on Ubuntu

How Falcon Next-Gen SIEM Protects Enterprises from VMware vCenter Attacks

5 Bash Scripts I Use Daily as a Linux SysAdmin

11 Best Self-Hosted Alternatives to Google Photos

How to Verify Debian and Ubuntu Packages Using MD5 Checksums

1. grep – Your Pattern-Matching Tool

2. awk – The Swiss Army Knife for Text Processing

3. sed – The Stream Editor for Quick Edits

4. cut – Simple Column Extraction

5. sort – Organize Your Data

6. uniq – Find and Count Unique Values

7. wc – Word Count (and More)

8. head and tail – Preview Your Data

9. find – Locate Files Across Directories

10. jq – JSON Processor Extraordinaire

Bonus: Combining Tools with Pipes

Practical Workflow Example

Conclusion

Similar Posts