Building a Large-Scale Automotive Data Scraper: Part 2

Technical Implementation

This is Part 2 of a multi-part series on building large-scale web scrapers.

Building Large-Scale Automotive Data Scraper

Introduction

As you saw in part 1 I talked about the project in general, how I discussed it with the client, and how I planned to structure it. In today’s post I’ll be showing you a more detailed approach for building this project.

The 5-Step SNAP Strategy

For every scraping project I use this strategy that I invented and follow. It has worked for all of them:

Need

Define targets

Place

Locate structure

Access

Determine delivery

Issues

Note blockers

Speed

Optimize path

Here’s how I applied it for this website:

1. Need – Defining the Targets

In our case the data was straightforward. We needed information about cars placed for sale on the website along with information about every car dealer.

Data Models Overview

Dealership Model

dealerIdPrimary Key

dealerName

address, city, state, zipCode

latitude, longitude

phoneNumber, website, email

refId (unique identifier)

urlSlug

hours

Car Model

CarIdPrimary Key

dealerForeign Key

make, model, model_year

price, mileage

condition, main_image

engine, fuel, vin

transmission, drive_train

exterior/interior color

images, features arrays

fuel_economy, days_listed

refId, domain

Database Relationship Model

Dealership

dealerIdPK

dealerName

address

phoneNumber

1:N

Car

carIdPK

dealerIdFK

make, model

price

2. Place – Understanding the Website Structure

While examining the website, I discovered that most data was embedded within simple HTML tags. However, there was a catch: certain important information like:

'vin'

'images'

'days_listed'

'price_changes'

was only accessible through JavaScript loading from an internal public API:

https://api.carsforsale.com/api/vehicle/profile/retrieve

3. Access – Getting to the Data

For this step I think only about how I can access the data from a blank page. In our case it worked like this:

To access all the cars, I could perform a search without filters and navigate through the pages. I verified this approach by manually jumping to the last page (170,000th), which loaded successfully.

Three-Step Approach:

Generate all search page URLs
Extract the 15 vehicle listing URLs from each search page
Request both the individual listing page and one additional public API call for each car

4. Issues – Tackling Access Challenges

During implementation, I encountered three significant technical challenges that required creative solutions:

🛡️ Cloudflare Protection

The website employed Cloudflare Turnstile captcha as its first line of defense. This made conventional request-based scraping impossible without browser automation.

🔑 Session Management

The authentication system utilized two key components:

Session Token: Verified successful captcha completion
JWT Token: Required refreshing every 5 minutes

🚫 IP Blocking

While not aggressively blocking IPs, the system would flag and block server IPs after approximately 2 million requests.

💡 The Hybrid Solution

I developed an efficient hybrid approach that balanced security bypass with performance:

1Browser Session Management

Use Playwright with a patched Firefox driver for brief (1-2 minute) sessions to:

Launch the browser in stealth mode
Complete the Turnstile captcha legitimately
Extract valid session cookies and tokens
Transfer credentials to a lightweight HTTP client
Shut down the resource-intensive browser

2Intelligent JWT Token Management

Bootstrap Phase

Extract initial jwt_token and refreshToken from the HTML

Steady State

Track token age and proactively refresh before expiry via dedicated endpoints

Failsafe

If token refresh failed twice, re-bootstrap with a fresh browser session

3Minimize Footprint

Reduce requests to only:

One request to fetch listing IDs from search pages
One request for the pre-JavaScript HTML of individual listings
One targeted API call for additional vehicle data

Result: This approach delivered the perfect balance, legitimate access credentials from real browser sessions with the speed and resource efficiency of lightweight HTTP requests.

Hybrid Solution Architecture

JWT Token Management Flow

5. Speed – Optimizing the Process

I might have listed some of these steps earlier, but I always try solving the issues in the most optimized way. The goal is minimizing interactions with the website without triggering behavior trackers.

After testing everything it was time to link all steps together. Since we needed to scrape a large number of records in the shortest time possible, we had to design an optimal strategy.

At first I used a single request client with async requests to send multiple calls at the same time. But this wasted time whenever the session expired and I had to open the browser again. So I switched to multiple clients (meaning multiple browser and request sessions).

🏗️ Scalable Distributed Architecture

🧠 Manager Service

Monitoring performance metrics and success rates
Publishing page lists to RabbitMQ work queues
Scheduling tri-weekly data extractions
Tracking progress across workers
Detecting changes between runs

🔍 Scraper Service

Processing 15 concurrent car requests per page
Managing JWT token refresh cycles
Handling errors gracefully
Extracting data from both search and detail pages

💾 Storage Service

Validating data before database insertion
Performing efficient batch operations
Normalizing dealer data formats
Managing relationships between cars and dealerships