Online vs offline systems¶
- Services (Online Systems)
- Wait for client requests.
- Process each request quickly and return a response.
- Primary performance metric: Response time (latency).
- Key requirement: High availability (must be reachable when needed).
- Example: Web servers, APIs, databases serving live queries.
- Batch Processing Systems (Offline Systems)
- Process large volumes of data as jobs.
- Jobs run periodically (e.g., daily) and may take minutes to days.
- No user is waiting for immediate results.
- Primary performance metric: Throughput (how much data can be processed in a given time).
- Example: Nightly analytics, report generation, large-scale data transformations.
- Stream Processing Systems (Near-Real-Time Systems)
- Continuously consume events and produce outputs.
- Process data shortly after events occur (not waiting for full datasets).
- Lower latency than batch systems.
- Bridge between online and batch processing.
- Primary focus: Low-latency processing of continuous data streams.
- Example: Real-time fraud detection, live metrics dashboards.
Core Differences at a Glance
| System Type | Input Model | User Waiting? | Main Metric | Latency |
|---|---|---|---|---|
| Services | Request-driven | Yes | Response time | Very low |
| Batch | Fixed dataset | No | Throughput | High |
| Stream | Continuous events | Usually no | Low latency + throughput | Low |
In [1]:
Copied!
import time
import time
In [2]:
Copied!
! wc -l resources/mahabharat_gutenberg_lemmatized_sents.txt
! wc -l resources/mahabharat_gutenberg_lemmatized_sents.txt
305796 resources/mahabharat_gutenberg_lemmatized_sents.txt
Unix style word count¶
In [4]:
Copied!
! time cat resources/mahabharat_gutenberg_lemmatized_sents.txt | tr -cs '[:alnum:]' '\n' | sort | uniq -c | sort -nr | awk '{print $2 ", " $1}' | head -10
! time cat resources/mahabharat_gutenberg_lemmatized_sents.txt | tr -cs '[:alnum:]' '\n' | sort | uniq -c | sort -nr | awk '{print $2 ", " $1}' | head -10
o, 47316 thou, 43002 say, 32130 king, 29268 great, 21822 man, 21144 thee, 20832 son, 20676 art, 18084 thy, 17856 awk: write failure (Broken pipe) awk: close failed on file "/dev/stdout" (Broken pipe) real 0m2.067s user 0m2.073s sys 0m0.106s
In [10]:
Copied!
# - resources/mahabharat_gutenberg_lemmatized_sents.txt: Reads the file content.
# - tr -cs '[:alnum:]' '\n': Replaces all non-alphanumeric characters (punctuation, spaces) with newlines (\n). The -s (squeeze) flag removes extra blank lines. This effectively puts each word on its own line.
# - tr '[:upper:]' '[:lower:]': Converts all text to lowercase to ensure "Word" and "word" count as the same.
# - sort: Sorts the words alphabetically (required for uniq to work).
# - uniq -c: Counts adjacent duplicate lines. The output format is count word.
# - sort -nr: Sorts the result numerically (-n) in descending order (-r), so the most frequent words appear at the top.
# - awk '{print $2 ", " $1}': Formats the output to match your request (word, count). uniq outputs count as the first column (\$1) and word as the second (\$2), so we swap them.
# - resources/mahabharat_gutenberg_lemmatized_sents.txt: Reads the file content.
# - tr -cs '[:alnum:]' '\n': Replaces all non-alphanumeric characters (punctuation, spaces) with newlines (\n). The -s (squeeze) flag removes extra blank lines. This effectively puts each word on its own line.
# - tr '[:upper:]' '[:lower:]': Converts all text to lowercase to ensure "Word" and "word" count as the same.
# - sort: Sorts the words alphabetically (required for uniq to work).
# - uniq -c: Counts adjacent duplicate lines. The output format is count word.
# - sort -nr: Sorts the result numerically (-n) in descending order (-r), so the most frequent words appear at the top.
# - awk '{print $2 ", " $1}': Formats the output to match your request (word, count). uniq outputs count as the first column (\$1) and word as the second (\$2), so we swap them.
Custom program¶
In [6]:
Copied!
import re
from collections import Counter
def count_words(filename):
with open(filename, 'r') as f:
text = f.read().lower()
# Find all words using regex (ignores punctuation)
words = re.findall(r'\b\w+\b', text)
# Count and sort by most common
word_counts = Counter(words).most_common(10)
for word, count in word_counts:
print(f"{word}, {count}")
# Usage
s_time = time.time()
count_words('resources/mahabharat_gutenberg_lemmatized_sents.txt')
print(f"took: {time.time() - s_time}s")
import re
from collections import Counter
def count_words(filename):
with open(filename, 'r') as f:
text = f.read().lower()
# Find all words using regex (ignores punctuation)
words = re.findall(r'\b\w+\b', text)
# Count and sort by most common
word_counts = Counter(words).most_common(10)
for word, count in word_counts:
print(f"{word}, {count}")
# Usage
s_time = time.time()
count_words('resources/mahabharat_gutenberg_lemmatized_sents.txt')
print(f"took: {time.time() - s_time}s")
o, 51042 thou, 45402 say, 32130 king, 30018 great, 21906 man, 21216 thee, 20910 son, 20682 thy, 18942 art, 18084 took: 0.8784844875335693s
In [9]:
Copied!
import re
from collections import Counter
def count_words_stream(filename):
word_counts = Counter()
# Compile regex once for better performance
pattern = re.compile(r'\b\w+\b')
with open(filename, 'r') as f:
for line in f:
# Find words in the current line only
words_in_line = pattern.findall(line.lower())
# Update the existing counter
word_counts.update(words_in_line)
# Print results
for word, count in word_counts.most_common(10):
print(f"{word}, {count}")
# Usage
s_time = time.time()
count_words_stream('resources/mahabharat_gutenberg_lemmatized_sents.txt')
print(f"took: {time.time() - s_time}s")
import re
from collections import Counter
def count_words_stream(filename):
word_counts = Counter()
# Compile regex once for better performance
pattern = re.compile(r'\b\w+\b')
with open(filename, 'r') as f:
for line in f:
# Find words in the current line only
words_in_line = pattern.findall(line.lower())
# Update the existing counter
word_counts.update(words_in_line)
# Print results
for word, count in word_counts.most_common(10):
print(f"{word}, {count}")
# Usage
s_time = time.time()
count_words_stream('resources/mahabharat_gutenberg_lemmatized_sents.txt')
print(f"took: {time.time() - s_time}s")
o, 51042 thou, 45402 say, 32130 king, 30018 great, 21906 man, 21216 thee, 20910 son, 20682 thy, 18942 art, 18084 took: 0.8076303005218506s
Comparison: Unix Pipeline vs Custom Program¶
| Feature | Unix Pipeline | Custom Program (Python) |
|---|---|---|
| Development Speed | Very fast for simple tasks | Slower (requires writing boilerplate) |
| Performance | Extremely fast (optimized C binaries) | Slower (interpreted), though optimized libraries help |
| Memory Usage | Efficient (streams data between processes) | Depends on implementation (easy to accidentally load into memory) |
| Readability | Low (cryptic syntax for complex logic) | High (readable code structure) |
| Extensibility | Limited (hard to add complex business logic) | Unlimited (access to full ecosystem) |
| Debugging | Difficult (opaque data flow) | Easy (debuggers, logging, print statements) |
Summary¶
- Use Unix Pipelines for quick, ad-hoc data exploration and simple transformations on text logs.
- Use Custom Programs when requirements become complex, logic needs to be tested/maintained, or performance tuning requires specific algorithmic control.
Breakdown: Unix Pipeline as MapReduce¶
The Unix pipeline we used is conceptually very similar to the MapReduce programming model. Here is how the components map to each other:
| Unix Command | MapReduce Phase | Description |
|---|---|---|
cat |
Input Reader | Reads raw data from the source. |
tr -cs '[:alnum:]' '\n' |
Map | Tokenizes the input. Takes a stream of text and emits a stream of words (one per line). In MapReduce, this is the map() function emitting (word, 1) pairs (implicitly here by just emitting the word). |
sort |
Shuffle & Sort | Groups identical keys (words) together. This prepares the data for the reducer, ensuring all instances of "king" arrive at the reducer together. |
uniq -c |
Reduce | Aggregates the grouped data. It counts the occurrences of each unique word. This corresponds to the reduce() function summing up counts. |
sort -nr |
Post-Processing | Sorts the final output by frequency (value) rather than word (key). |
This simple pipeline demonstrates the core philosophy of batch processing: decompose a large task into a sequence of simple data transformations.
References¶
- Chapter 10, Designing Data Intensive Applications