Building a Large-Scale Automotive Data Scraper: Part 2
Technical Implementation

Introduction
As you saw in part 1 I talked about the project in general, how I discussed it with the client, and how I planned to structure it. In today’s post I’ll be showing you a more detailed approach for building this project.
The 5-Step SNAP Strategy
For every scraping project I use this strategy that I invented and follow. It has worked for all of them:
Here’s how I applied it for this website:
1. Need – Defining the Targets
In our case the data was straightforward. We needed information about cars placed for sale on the website along with information about every car dealer.
Data Models Overview
Dealership Model
Car Model
Database Relationship Model
Dealership
Car
2. Place – Understanding the Website Structure
While examining the website, I discovered that most data was embedded within simple HTML tags. However, there was a catch: certain important information like:
was only accessible through JavaScript loading from an internal public API:
https://api.carsforsale.com/api/vehicle/profile/retrieve
3. Access – Getting to the Data
For this step I think only about how I can access the data from a blank page. In our case it worked like this:
To access all the cars, I could perform a search without filters and navigate through the pages. I verified this approach by manually jumping to the last page (170,000th), which loaded successfully.
Three-Step Approach:
- Generate all search page URLs
- Extract the 15 vehicle listing URLs from each search page
- Request both the individual listing page and one additional public API call for each car
4. Issues – Tackling Access Challenges
During implementation, I encountered three significant technical challenges that required creative solutions:
🛡️ Cloudflare Protection
The website employed Cloudflare Turnstile captcha as its first line of defense. This made conventional request-based scraping impossible without browser automation.
🔑 Session Management
The authentication system utilized two key components:
- Session Token: Verified successful captcha completion
- JWT Token: Required refreshing every 5 minutes
🚫 IP Blocking
While not aggressively blocking IPs, the system would flag and block server IPs after approximately 2 million requests.
💡 The Hybrid Solution
I developed an efficient hybrid approach that balanced security bypass with performance:
1Browser Session Management
Use Playwright with a patched Firefox driver for brief (1-2 minute) sessions to:
- Launch the browser in stealth mode
- Complete the Turnstile captcha legitimately
- Extract valid session cookies and tokens
- Transfer credentials to a lightweight HTTP client
- Shut down the resource-intensive browser
2Intelligent JWT Token Management
Extract initial jwt_token
and refreshToken
from the HTML
Track token age and proactively refresh before expiry via dedicated endpoints
If token refresh failed twice, re-bootstrap with a fresh browser session
3Minimize Footprint
Reduce requests to only:
- One request to fetch listing IDs from search pages
- One request for the pre-JavaScript HTML of individual listings
- One targeted API call for additional vehicle data
Result: This approach delivered the perfect balance, legitimate access credentials from real browser sessions with the speed and resource efficiency of lightweight HTTP requests.
Hybrid Solution Architecture
JWT Token Management Flow
5. Speed – Optimizing the Process
I might have listed some of these steps earlier, but I always try solving the issues in the most optimized way. The goal is minimizing interactions with the website without triggering behavior trackers.
After testing everything it was time to link all steps together. Since we needed to scrape a large number of records in the shortest time possible, we had to design an optimal strategy.
At first I used a single request client with async requests to send multiple calls at the same time. But this wasted time whenever the session expired and I had to open the browser again. So I switched to multiple clients (meaning multiple browser and request sessions).
🏗️ Scalable Distributed Architecture
🧠 Manager Service
- Monitoring performance metrics and success rates
- Publishing page lists to RabbitMQ work queues
- Scheduling tri-weekly data extractions
- Tracking progress across workers
- Detecting changes between runs
🔍 Scraper Service
- Processing 15 concurrent car requests per page
- Managing JWT token refresh cycles
- Handling errors gracefully
- Extracting data from both search and detail pages
💾 Storage Service
- Validating data before database insertion
- Performing efficient batch operations
- Normalizing dealer data formats
- Managing relationships between cars and dealerships