Building a Large-Scale Automotive Data Scraper: Part 2

Technical Implementation

This is Part 2 of a multi-part series on building large-scale web scrapers.
Building Large-Scale Automotive Data Scraper

Introduction

As you saw in part 1 I talked about the project in general, how I discussed it with the client, and how I planned to structure it. In today’s post I’ll be showing you a more detailed approach for building this project.

The 5-Step SNAP Strategy

For every scraping project I use this strategy that I invented and follow. It has worked for all of them:

N
Need
Define targets
P
Place
Locate structure
A
Access
Determine delivery
I
Issues
Note blockers
S
Speed
Optimize path

Here’s how I applied it for this website:

1. Need – Defining the Targets

In our case the data was straightforward. We needed information about cars placed for sale on the website along with information about every car dealer.

Data Models Overview

Dealership Model

dealerIdPrimary Key
dealerName
address, city, state, zipCode
latitude, longitude
phoneNumber, website, email
refId (unique identifier)
urlSlug
hours

Car Model

CarIdPrimary Key
dealerForeign Key
make, model, model_year
price, mileage
condition, main_image
engine, fuel, vin
transmission, drive_train
exterior/interior color
images, features arrays
fuel_economy, days_listed
refId, domain

Database Relationship Model

Dealership
dealerIdPK
dealerName
address
phoneNumber
1:N
Car
carIdPK
dealerIdFK
make, model
price

2. Place – Understanding the Website Structure

While examining the website, I discovered that most data was embedded within simple HTML tags. However, there was a catch: certain important information like:

'vin'
'images'
'days_listed'
'price_changes'

was only accessible through JavaScript loading from an internal public API:

https://api.carsforsale.com/api/vehicle/profile/retrieve

3. Access – Getting to the Data

For this step I think only about how I can access the data from a blank page. In our case it worked like this:

To access all the cars, I could perform a search without filters and navigate through the pages. I verified this approach by manually jumping to the last page (170,000th), which loaded successfully.

Three-Step Approach:

  1. Generate all search page URLs
  2. Extract the 15 vehicle listing URLs from each search page
  3. Request both the individual listing page and one additional public API call for each car

4. Issues – Tackling Access Challenges

During implementation, I encountered three significant technical challenges that required creative solutions:

🛡️ Cloudflare Protection

The website employed Cloudflare Turnstile captcha as its first line of defense. This made conventional request-based scraping impossible without browser automation.

🔑 Session Management

The authentication system utilized two key components:

  • Session Token: Verified successful captcha completion
  • JWT Token: Required refreshing every 5 minutes

🚫 IP Blocking

While not aggressively blocking IPs, the system would flag and block server IPs after approximately 2 million requests.

💡 The Hybrid Solution

I developed an efficient hybrid approach that balanced security bypass with performance:

1Browser Session Management

Use Playwright with a patched Firefox driver for brief (1-2 minute) sessions to:

  1. Launch the browser in stealth mode
  2. Complete the Turnstile captcha legitimately
  3. Extract valid session cookies and tokens
  4. Transfer credentials to a lightweight HTTP client
  5. Shut down the resource-intensive browser
2Intelligent JWT Token Management
Bootstrap Phase

Extract initial jwt_token and refreshToken from the HTML

Steady State

Track token age and proactively refresh before expiry via dedicated endpoints

Failsafe

If token refresh failed twice, re-bootstrap with a fresh browser session

3Minimize Footprint

Reduce requests to only:

  • One request to fetch listing IDs from search pages
  • One request for the pre-JavaScript HTML of individual listings
  • One targeted API call for additional vehicle data

Result: This approach delivered the perfect balance, legitimate access credentials from real browser sessions with the speed and resource efficiency of lightweight HTTP requests.

Hybrid Solution Architecture
JWT Token Management Flow

5. Speed – Optimizing the Process

I might have listed some of these steps earlier, but I always try solving the issues in the most optimized way. The goal is minimizing interactions with the website without triggering behavior trackers.

After testing everything it was time to link all steps together. Since we needed to scrape a large number of records in the shortest time possible, we had to design an optimal strategy.

At first I used a single request client with async requests to send multiple calls at the same time. But this wasted time whenever the session expired and I had to open the browser again. So I switched to multiple clients (meaning multiple browser and request sessions).

🏗️ Scalable Distributed Architecture

🧠 Manager Service

  • Monitoring performance metrics and success rates
  • Publishing page lists to RabbitMQ work queues
  • Scheduling tri-weekly data extractions
  • Tracking progress across workers
  • Detecting changes between runs

🔍 Scraper Service

  • Processing 15 concurrent car requests per page
  • Managing JWT token refresh cycles
  • Handling errors gracefully
  • Extracting data from both search and detail pages

💾 Storage Service

  • Validating data before database insertion
  • Performing efficient batch operations
  • Normalizing dealer data formats
  • Managing relationships between cars and dealerships