AutoTrader Car Listings Scraper: Development Experience

AutoTrader Car Listings Scraper: Development Experience


1. Introduction

Brief Overview of Project Goals

The AutoTrader Car Listings Scraper project aim is creating a robust, efficient, and scalable system to extract car listing data from AutoTrader UK. The key goals of the project were:

  1. To develop a scraper capable of navigating AutoTrader’s complex search interface and extracting listings across all makes, models, and regions.
  2. To create a system that could handle the large volume of data on AutoTrader, potentially millions of listings, without missing any.
  3. To design a scraper that could extract detailed information from individual listing pages, including specifications, pricing, and seller information.
  4. To implement a distributed architecture that could scale horizontally to improve scraping speed and efficiency.
  5. To build a reliable system that could handle network issues, anti-scraping measures, and other potential obstacles.

Initial Expectations and Planning

When I started working on this project, I had several expectations and plans:

  1. Timeline: I initially estimated the project would take about 1 week to complete, from initial design to a fully functional system.

  2. Technical Stack: My plan was to use Python as the primary language, with DrissionPage for browser automation and PostgreSQL for data storage. I expected to use Docker for containerization to ease deployment and scaling.

  3. Challenges: I anticipated that the main challenges would be handling AutoTrader’s anti-scraping measures and managing the large volume of data. I underestimated the complexity of optimizing the search process to ensure comprehensive coverage.

As the project progressed, many of these initial expectations were challenged.

2. Technical Challenges

2.1 Web Scraping Challenges

  1. Result limiting: To access all the listings in the website. I had to do a search with no filters. Each page have 10 listings with other ads. However the website limit you to see only 100 pages then asks you to customize filters to get more results which means you can only access 1000 listing by filter.

  2. Content Access: The website uses cloudflare protection. means you can’t access the content using a normal request mechanism like requests in python and the only way is using a Browser.

  3. Scaling: Using browser automation to extract all the details will be really slow and consume lot of ressources (proxy bandwidth, cpu, memory, time). This problem is solved in Anti-Scraping Measures and Browser Automation

2.2 Anti-Scraping Measures and Browser Automation

First Plan was using DrissionPage. It’s the most advanced library out there for bypassing bot detection. it’s not like selenium it doesn’t any automation api and leaves no trace of automation. It worked really good for loading pages and downloading its content. The only problem was it will take a lot of ressources and will need more cost to scale.

I had an idea of combining requests and browser automation. The website was using graphql to pull data from database. I was able to execute fetch requests and get the data too. so I made a trick to automate that.

You only need to open a browser and open the autotrader website and use this code to execute requests:

js_code = f"""
return fetch("{url}", {{
    method: 'POST',
    headers: {{
        "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0",
        "Accept": "*/*",
        "Accept-Language": "en-US,en;q=0.5",
        "content-type": "application/json",
        "x-sauron-app-name": "sauron-search-results-app",
        "x-sauron-app-version": "4e22f98c66",
        "Sec-Fetch-Dest": "empty",
        "Sec-Fetch-Mode": "cors",
        "Sec-Fetch-Site": "same-origin",
        "Priority": "u=4"
    }},
    body: JSON.stringify({json.dumps(data)})
}})
.then(response => response.text())
.catch(error => error.toString());
"""
res = self.browser.execute_js_code(js_code)

This simple code snippets allowed me to execute requests and pull any data from the website database without loading pages or getting blocked by cloudflare.

Here were things started to look good I rewrited all the codes responsible for extracting listings from search pages and extracting indivdual listings data.

After finishing everything. I want to deploy the app but I had lot of problems when using DrissionPage inside a docker container I think this library still need lot of improvement. I had to switch to selenium-base and use the UndetectedChrome feature It worked well but still not as good as DrissionPage for bypassing cloudflares. However after combining it with the BrowserPool it worked good because it keeps regenerating the browser to avoid getting blocked.

2.3 Performance Optimization

The new approach of combining requests automation and browser automation helped in saving time and ressources. because we don’t load pages anymore or wait for javascript loading.

The graphql requests allowed me to pull data of multiple listings (max 8) using one requests this improved the speed.

2.4 Data Processing and Storage

I used Peewee ORM and designed to the database to use Two tables Dealer and VehicleListing.

class Dealer(BaseModel):
    dealer_id = CharField(unique=True, primary_key=True)
    name = CharField(null=True)
    address_one = CharField(null=True)
    address_two = CharField(null=True)
    town = CharField(null=True)
    county = CharField(null=True)
    postcode = CharField(null=True)
    phone = CharField(null=True)
    website = TextField(null=True)
    review_rating = FloatField(null=True)
    review_count = IntegerField(null=True)
    stock_count = IntegerField(null=True)

class VehicleListing(BaseModel):
    vehicle_id = CharField(unique=True, primary_key=True)

    # Basic Info
    title = CharField()
    price = DecimalField(max_digits=10, decimal_places=2)
    condition = CharField(null=True)
    year = IntegerField(null=True)
    registration_date = DateField(null=True)
    registration = CharField(null=True)
    mileage = IntegerField(null=True)
    mileage_unit = CharField(null=True)
    colour = CharField(null=True)
    owners = IntegerField(null=True)
    service_history = CharField(null=True)
    mot_expiry = DateField(null=True)
    attention_grabber = CharField(null=True)
    description = TextField(null=True)

    # Vehicle Specifications
    make = CharField(null=True)
    model = CharField(null=True)
    trim = CharField(null=True)
    body_type = CharField(null=True)
    fuel_type = CharField(null=True)
    transmission = CharField(null=True)
    drivetrain = CharField(null=True)
    doors = IntegerField(null=True)
    seats = IntegerField(null=True)
    engine_size_litres = FloatField(null=True)
    engine_size_cc = IntegerField(null=True)
    engine_power_ps = IntegerField(null=True)
    emission_class = CharField(null=True)
    co2_emissions = IntegerField(null=True)

    # Performance and Economy
    top_speed = CharField(null=True)
    zero_to_sixty_two_mph = CharField(null=True)
    fuel_consumption_combined = CharField(null=True)
    fuel_consumption_extra_urban = CharField(null=True)
    fuel_consumption_urban = CharField(null=True)
    annual_tax_standard_rate = CharField(null=True)

    # Dimensions
    length = CharField(null=True)
    width = CharField(null=True)
    height = CharField(null=True)
    wheelbase = CharField(null=True)

    # Additional Services
    part_exchange_available = BooleanField(null=True)
    finance_available = BooleanField(null=True)

    # Vehicle Check
    vehicle_check_status = CharField(null=True)
    checks_passed = CharField(null=True)

    # Dealer or Seller Info
    dealer = ForeignKeyField(Dealer, backref='listings', null=True)
    seller_name = CharField(null=True)
    seller_type = CharField(null=True)
    seller_location = CharField(null=True)
    seller_phone = CharField(null=True)

    # JSON Fields
    images = JSONField(null=True)
    battery_info = JSONField(null=True)
    key_features = JSONField(null=True)

3. Architecture and Scalability

System Overview

This Scraper is designed as a distributed system with two main components: the Listings Scraper and the Listings Info Scraper. These components work together to provide a comprehensive car listing database.

Data Flow

  1. The Listings Scraper initiates the process by searching for car listings using optimized filters.
  2. Basic listing information is extracted and processed.
  3. Processed listing data is sent as messages to RabbitMQ.
  4. The Listings Info Scraper consumes these messages from RabbitMQ.
  5. Detailed information is scraped for each listing.
  6. Scraped data is parsed, processed, and stored in the database.

Integration with RabbitMQ

RabbitMQ serves as the message broker between the Listings Scraper and Listings Info Scraper:

  • The Listings Scraper acts as a producer, sending messages containing basic listing information and URLs for detailed scraping.
  • The Listings Info Scraper acts as a consumer, receiving these messages and performing detailed scraping tasks.
  • This architecture allows for scalable and distributed processing, as multiple instances of the Listings Info Scraper can consume messages concurrently.

This design enables efficient, scalable scraping of car listings, with clear separation of concerns between initial listing discovery and detailed information gathering.

4. Conclusion

Developing the AutoTrader Car Listings Scraper has been a challenging yet rewarding experience. The project pushed the boundaries of my technical skills and problem-solving abilities, particularly in areas of web scraping, distributed systems, and performance optimization.

Key successes include:

  1. Overcoming AutoTrader’s complex search interface and result limiting mechanisms.
  2. Developing an innovative approach combining browser automation and direct requests for efficient data extraction.
  3. Implementing a scalable, distributed architecture using RabbitMQ for seamless coordination between components.

Challenges that led to significant learning experiences:

  1. Adapting to AutoTrader’s anti-scraping measures and Cloudflare protection.
  2. Optimizing resource usage while maintaining scraping efficiency.
  3. Balancing the trade-offs between different browser automation tools.

The project ultimately exceeded initial expectations in terms of functionality and efficiency, though it required more time and problem-solving than initially

© 2024 Mohsine Maiet