Performance Improvements - Comprehensive Guide¶
Overview¶
We've achieved 7-8x performance improvements through three key optimizations:
- Batch processing functions - Reduce
block_oncalls - Global HTTP client - Connection pooling and reuse
- CPU-aware Tokio runtime - Dynamic worker thread allocation
- Memory optimization - String interning and efficient request generation
Key Improvements¶
1. Global HTTP Client with Connection Pooling¶
Before:
After:
// Global singleton with connection pooling
static HTTP_CLIENT: Lazy<PlayStoreClient> = Lazy::new(|| {
PlayStoreClient::new(30).expect("Failed to create HTTP client")
});
Benefits:
- TCP connections are reused across requests
- Reduced connection establishment overhead
- Better resource utilization
2. CPU-Aware Tokio Runtime¶
Before:
After:
let num_cpus = std::thread::available_parallelism()
.map(|n| n.get())
.unwrap_or(4);
let worker_threads = (num_cpus / 2).clamp(2, 8);
Configuration:
- Uses half of available CPU cores
- Minimum: 2 workers
- Maximum: 8 workers
- Leaves CPU resources for Python threads
3. Batch Processing Functions¶
The Problem:
# Sequential calls - Multiple block_on invocations
for request in requests:
result = fetch_and_parse_list(...) # Each call blocks the runtime
The Solution:
// Single block_on with parallel futures
runtime.block_on(async {
let futures: Vec<_> = requests.iter()
.map(|req| client.fetch_and_parse_list(...))
.collect();
// True parallel execution inside Rust!
try_join_all(futures).await
})
New Batch Functions¶
1. fetch_and_parse_apps_batch¶
Fetch multiple app pages in parallel.
from playfast._core import fetch_and_parse_apps_batch
requests = [
("com.spotify.music", "en", "us"),
("com.netflix.mediaclient", "en", "us"),
("com.whatsapp", "en", "us"),
]
apps = fetch_and_parse_apps_batch(requests)
# Returns: list[RustAppInfo]
2. fetch_and_parse_list_batch¶
Fetch multiple category/collection listings in parallel.
from playfast._core import fetch_and_parse_list_batch
requests = [
("GAME_ACTION", "topselling_free", "en", "us", 100),
("SOCIAL", "topselling_free", "en", "kr", 100),
(None, "topselling_paid", "en", "jp", 50), # None = all apps
]
results = fetch_and_parse_list_batch(requests)
# Returns: list[list[RustSearchResult]]
3. fetch_and_parse_search_batch¶
Perform multiple searches in parallel.
from playfast._core import fetch_and_parse_search_batch
requests = [
("spotify", "en", "us"),
("netflix", "en", "us"),
("youtube", "en", "us"),
]
results = fetch_and_parse_search_batch(requests)
# Returns: list[list[RustSearchResult]]
4. fetch_and_parse_reviews_batch¶
Fetch reviews for multiple apps in parallel.
from playfast._core import fetch_and_parse_reviews_batch
requests = [
("com.spotify.music", "en", "us", 1, None), # sort=1 (newest)
("com.netflix.mediaclient", "en", "us", 2, None), # sort=2 (highest)
]
results = fetch_and_parse_reviews_batch(requests)
# Returns: list[tuple[list[RustReview], str | None]]
Performance Results¶
Benchmark: 25 Category Requests¶
| Method | Time | Req/s | Speedup |
|---|---|---|---|
| Batch (all at once) | 1.25s | 20.05 | 7.97x 🚀 |
| Batch (5 per batch) | 2.82s | 8.85 | 3.52x |
| Sequential (baseline) | 9.94s | 2.51 | 1.00x |
Key Findings¶
- 87.5% Performance Improvement: Batch processing is nearly 8x faster
- Block-on Optimization: Reducing
block_oncalls is critical - Scalability: Larger batches perform better (up to a point)
Example: 5 App Pages¶
Example: 3 Country Comparison¶
Architecture Comparison¶
Sequential Processing (Old)¶
Python Thread 1:
[block_on] → Request 1 → [wait] → Result 1
[block_on] → Request 2 → [wait] → Result 2
[block_on] → Request 3 → [wait] → Result 3
Total: 3 runtime enter/exit cycles
Batch Processing (New)¶
Python Thread 1:
[block_on] → {
Request 1 → [async await] → Result 1
Request 2 → [async await] → Result 2
Request 3 → [async await] → Result 3
}
Total: 1 runtime enter/exit cycle
All requests execute in parallel!
Best Practices¶
When to Use Batch Functions¶
✅ Use batch functions when:
- Fetching multiple items of the same type
- Processing data from multiple countries
- Collecting category/collection data at scale
- Need maximum throughput
❌ Use single functions when:
- Fetching only one item
- Need fine-grained error handling per request
- Sequential processing is required by business logic
Example: Multi-Country Data Collection¶
from playfast._core import fetch_and_parse_list_batch
# Collect top apps from 10 countries and 5 categories
countries = ["us", "kr", "jp", "de", "gb", "fr", "br", "in", "ca", "au"]
categories = ["GAME_ACTION", "SOCIAL", "PRODUCTIVITY", "ENTERTAINMENT", "COMMUNICATION"]
requests = [
(cat, "topselling_free", "en", country, 200)
for country in countries
for cat in categories
]
# 50 requests in parallel with a single function call!
results = fetch_and_parse_list_batch(requests)
# Process results
for i, (cat, country) in enumerate([(c, co) for co in countries for c in categories]):
apps = results[i]
print(f"{country.upper()} / {cat}: {len(apps)} apps")
Migration Guide¶
Before (Sequential)¶
results = []
for app_id in app_ids:
app = fetch_and_parse_app(app_id, "en", "us")
results.append(app)
After (Batch)¶
requests = [(app_id, "en", "us") for app_id in app_ids]
results = fetch_and_parse_apps_batch(requests)
Technical Details¶
Why Batch Processing is Faster¶
-
Single Runtime Entry
-
Only one
block_oncall reduces context switching -
Tokio runtime stays active throughout batch
-
True Parallel Execution
-
try_join_allruns all futures concurrently -
Limited only by tokio worker threads and network
-
Connection Pooling
-
Global HTTP client reuses TCP connections
-
DNS lookups are cached
-
Zero Python GIL Contention
-
All work happens in Rust
- GIL is released for the entire batch
Configuration Tuning¶
The runtime uses dynamic worker thread allocation:
// 16-core system: 8 workers
// 8-core system: 4 workers
// 4-core system: 2 workers
// 2-core system: 2 workers (minimum)
This leaves CPU cores available for:
- Python's main thread
- Other Python threads
- System processes
Limitations¶
-
All-or-Nothing: If one request fails, the entire batch fails
-
Consider smaller batches for better fault tolerance
-
Memory Usage: Large batches consume more memory
-
Recommended: 20-50 requests per batch
-
No Progress Updates: Batch completes as a whole
-
Use smaller batches if you need progress indicators
Future Improvements¶
- Add per-request error handling (return
Resultfor each) - Implement automatic batch size optimization
- Add request prioritization within batches
- Support streaming batch results
Conclusion¶
The batch processing functions provide 7-8x performance improvement for multi-request scenarios by:
- Reducing runtime enter/exit overhead
- Enabling true parallel execution in Rust
- Maximizing connection pooling benefits
- Eliminating Python GIL contention
For production use cases involving multiple requests, batch functions are strongly recommended.
See also:
examples/batch_usage.py- Working examplesbenchmarks/test_batch_performance.py- Performance comparisonspython/playfast/_core.pyi- Type hints and documentation