Online vs offline systems¶

Services (Online Systems)

Wait for client requests.
Process each request quickly and return a response.
Primary performance metric: Response time (latency).
Key requirement: High availability (must be reachable when needed).
Example: Web servers, APIs, databases serving live queries.

Batch Processing Systems (Offline Systems)

Process large volumes of data as jobs.
Jobs run periodically (e.g., daily) and may take minutes to days.
No user is waiting for immediate results.
Primary performance metric: Throughput (how much data can be processed in a given time).
Example: Nightly analytics, report generation, large-scale data transformations.

Stream Processing Systems (Near-Real-Time Systems)

Continuously consume events and produce outputs.
Process data shortly after events occur (not waiting for full datasets).
Lower latency than batch systems.
Bridge between online and batch processing.
Primary focus: Low-latency processing of continuous data streams.
Example: Real-time fraud detection, live metrics dashboards.

Core Differences at a Glance

System Type	Input Model	User Waiting?	Main Metric	Latency
Services	Request-driven	Yes	Response time	Very low
Batch	Fixed dataset	No	Throughput	High
Stream	Continuous events	Usually no	Low latency + throughput	Low

In [1]:

Copied!

import time
import time

In [2]:

Copied!

! wc -l resources/mahabharat_gutenberg_lemmatized_sents.txt
! wc -l resources/mahabharat_gutenberg_lemmatized_sents.txt

305796 resources/mahabharat_gutenberg_lemmatized_sents.txt

Unix style word count¶

In [4]:

Copied!

! time cat resources/mahabharat_gutenberg_lemmatized_sents.txt | tr -cs '[:alnum:]' '\n' | sort | uniq -c | sort -nr | awk '{print $2 ", " $1}' | head -10
! time cat resources/mahabharat_gutenberg_lemmatized_sents.txt | tr -cs '[:alnum:]' '\n' | sort | uniq -c | sort -nr | awk '{print $2 ", " $1}' | head -10

o, 47316
thou, 43002
say, 32130
king, 29268
great, 21822
man, 21144
thee, 20832
son, 20676
art, 18084
thy, 17856
awk: write failure (Broken pipe)
awk: close failed on file "/dev/stdout" (Broken pipe)

real	0m2.067s
user	0m2.073s
sys	0m0.106s

In [10]:

Copied!





# - resources/mahabharat_gutenberg_lemmatized_sents.txt: Reads the file content.
# - tr -cs '[:alnum:]' '\n': Replaces all non-alphanumeric characters (punctuation, spaces) with newlines (\n). The -s (squeeze) flag removes extra blank lines. This effectively puts each word on its own line.
# - tr '[:upper:]' '[:lower:]': Converts all text to lowercase to ensure "Word" and "word" count as the same.
# - sort: Sorts the words alphabetically (required for uniq to work).
# - uniq -c: Counts adjacent duplicate lines. The output format is count word.
# - sort -nr: Sorts the result numerically (-n) in descending order (-r), so the most frequent words appear at the top.
# - awk '{print $2 ", " $1}': Formats the output to match your request (word, count). uniq outputs count as the first column (\$1) and word as the second (\$2), so we swap them.
# - resources/mahabharat_gutenberg_lemmatized_sents.txt: Reads the file content.
# - tr -cs '[:alnum:]' '\n': Replaces all non-alphanumeric characters (punctuation, spaces) with newlines (\n). The -s (squeeze) flag removes extra blank lines. This effectively puts each word on its own line.
# - tr '[:upper:]' '[:lower:]': Converts all text to lowercase to ensure "Word" and "word" count as the same.
# - sort: Sorts the words alphabetically (required for uniq to work).
# - uniq -c: Counts adjacent duplicate lines. The output format is count word.
# - sort -nr: Sorts the result numerically (-n) in descending order (-r), so the most frequent words appear at the top.
# - awk '{print $2 ", " $1}': Formats the output to match your request (word, count). uniq outputs count as the first column (\$1) and word as the second (\$2), so we swap them.

Custom program¶

In [6]:

Copied!





import re
from collections import Counter

def count_words(filename):
    with open(filename, 'r') as f:
        text = f.read().lower()
        
    # Find all words using regex (ignores punctuation)
    words = re.findall(r'\b\w+\b', text)
    
    # Count and sort by most common
    word_counts = Counter(words).most_common(10)
    
    for word, count in word_counts:
        print(f"{word}, {count}")

# Usage
s_time = time.time()
count_words('resources/mahabharat_gutenberg_lemmatized_sents.txt')
print(f"took: {time.time() - s_time}s")

import re
from collections import Counter

def count_words(filename):
    with open(filename, 'r') as f:
        text = f.read().lower()
        
    # Find all words using regex (ignores punctuation)
    words = re.findall(r'\b\w+\b', text)
    
    # Count and sort by most common
    word_counts = Counter(words).most_common(10)
    
    for word, count in word_counts:
        print(f"{word}, {count}")

# Usage
s_time = time.time()
count_words('resources/mahabharat_gutenberg_lemmatized_sents.txt')
print(f"took: {time.time() - s_time}s")

o, 51042
thou, 45402
say, 32130
king, 30018
great, 21906
man, 21216
thee, 20910
son, 20682
thy, 18942
art, 18084
took: 0.8784844875335693s

In [9]:

Copied!





import re
from collections import Counter
def count_words_stream(filename):
    word_counts = Counter()
    # Compile regex once for better performance
    pattern = re.compile(r'\b\w+\b')
    with open(filename, 'r') as f:
        for line in f:
            # Find words in the current line only
            words_in_line = pattern.findall(line.lower())
            
            # Update the existing counter
            word_counts.update(words_in_line)
    
    # Print results
    for word, count in word_counts.most_common(10):
        print(f"{word}, {count}")
# Usage
s_time = time.time()
count_words_stream('resources/mahabharat_gutenberg_lemmatized_sents.txt')
print(f"took: {time.time() - s_time}s")
import re
from collections import Counter
def count_words_stream(filename):
    word_counts = Counter()
    # Compile regex once for better performance
    pattern = re.compile(r'\b\w+\b')
    with open(filename, 'r') as f:
        for line in f:
            # Find words in the current line only
            words_in_line = pattern.findall(line.lower())
            
            # Update the existing counter
            word_counts.update(words_in_line)
    
    # Print results
    for word, count in word_counts.most_common(10):
        print(f"{word}, {count}")
# Usage
s_time = time.time()
count_words_stream('resources/mahabharat_gutenberg_lemmatized_sents.txt')
print(f"took: {time.time() - s_time}s")

o, 51042
thou, 45402
say, 32130
king, 30018
great, 21906
man, 21216
thee, 20910
son, 20682
thy, 18942
art, 18084
took: 0.8076303005218506s

Comparison: Unix Pipeline vs Custom Program¶

Feature	Unix Pipeline	Custom Program (Python)
Development Speed	Very fast for simple tasks	Slower (requires writing boilerplate)
Performance	Extremely fast (optimized C binaries)	Slower (interpreted), though optimized libraries help
Memory Usage	Efficient (streams data between processes)	Depends on implementation (easy to accidentally load into memory)
Readability	Low (cryptic syntax for complex logic)	High (readable code structure)
Extensibility	Limited (hard to add complex business logic)	Unlimited (access to full ecosystem)
Debugging	Difficult (opaque data flow)	Easy (debuggers, logging, print statements)

Summary¶

Use Unix Pipelines for quick, ad-hoc data exploration and simple transformations on text logs.
Use Custom Programs when requirements become complex, logic needs to be tested/maintained, or performance tuning requires specific algorithmic control.

Breakdown: Unix Pipeline as MapReduce¶

The Unix pipeline we used is conceptually very similar to the MapReduce programming model. Here is how the components map to each other:

Unix Command	MapReduce Phase	Description
`cat`	Input Reader	Reads raw data from the source.
`tr -cs '[:alnum:]' '\n'`	Map	Tokenizes the input. Takes a stream of text and emits a stream of words (one per line). In MapReduce, this is the `map()` function emitting `(word, 1)` pairs (implicitly here by just emitting the word).
`sort`	Shuffle & Sort	Groups identical keys (words) together. This prepares the data for the reducer, ensuring all instances of "king" arrive at the reducer together.
`uniq -c`	Reduce	Aggregates the grouped data. It counts the occurrences of each unique word. This corresponds to the `reduce()` function summing up counts.
`sort -nr`	Post-Processing	Sorts the final output by frequency (value) rather than word (key).

This simple pipeline demonstrates the core philosophy of batch processing: decompose a large task into a sequence of simple data transformations.

References¶

Chapter 10, Designing Data Intensive Applications