Back to Blog Machine Learning

Forecasting Used Phone Prices with ML — IEEE ICAISS 2023

November 2023 8 min read Srikanth Badavath


This post covers the research behind my IEEE paper "Forecasting the Prices using Machine Learning Techniques: Special Reference to used Mobile Phones", published at the Second International Conference on Augmented Intelligence and Sustainable Systems (ICAISS 2023). Read the full paper on IEEE Xplore →

The Problem

The secondhand smartphone market is enormous but opaque. Sellers price phones based on gut feeling, and buyers overpay or miss good deals because there's no transparent pricing model. Sites like eBay and OLX show wildly varying prices for the same model in the same condition. Can machine learning build a reliable price predictor?

The answer is yes — but the real challenge was in the data, not the model.

End-to-End ML Pipeline

flowchart LR Scrape["Data Collection eBay · OLX · Flipkart n=3,200 listings"] Clean["Data Cleaning duplicates · nulls outlier removal"] EDA["EDA distributions correlation matrix"] FE["Feature Engineering depreciation ratio price-per-GB age buckets"] Encode["Encoding Ordinal: condition OHE: brand · OS"] Split["Train/Test Split 80/20 · stratified"] Models["Model Evaluation 6 algorithms"] Best["XGBoost R²=0.91"] SHAP["SHAP Analysis feature importance"] Scrape --> Clean Clean --> EDA EDA --> FE FE --> Encode Encode --> Split Split --> Models Models --> Best Best --> SHAP

Dataset Construction

We collected data on used smartphones across multiple platforms, capturing 13 features:

FeatureTypeNotes
brandCategoricalApple, Samsung, OnePlus, Xiaomi, etc.
modelCategoricalHigh cardinality — grouped by generation
ram_gbNumeric4, 6, 8, 12, 16 GB
storage_gbNumeric64, 128, 256, 512 GB
battery_mahNumeric2000–6000 mAh range
camera_mpNumericPrimary rear camera MP
screen_size_inchNumeric5.0–7.0 inches
age_monthsNumericMonths since original release
conditionOrdinalNew > Like-new > Good > Fair
has_5gBinary0/1
osCategoricaliOS / Android
launch_price_usdNumericOriginal MSRP at launch
used_price_usdNumericTarget variable
import pandas as pd
import numpy as np
from scipy import stats

df = pd.read_csv('used_phones.csv')

# Remove statistical outliers (prices > 3 std from mean per model)
df = df.groupby('model').apply(
    lambda x: x[np.abs(stats.zscore(x['used_price_usd'])) < 3]
).reset_index(drop=True)

# Condition: ordinal encoding (order matters)
condition_map = {'New': 4, 'Like New': 3, 'Good': 2, 'Fair': 1}
df['condition_score'] = df['condition'].map(condition_map)

print(f"Final dataset: {len(df)} listings, {df['brand'].nunique()} brands")

Feature Engineering

Raw features aren't always the most predictive. Three engineered features improved model performance significantly:

def engineer_features(df):
    # 1. Depreciation ratio: how fast does this brand/model lose value?
    # Captures brand premium and perceived durability
    df['depreciation_ratio'] = df['used_price_usd'] / df['launch_price_usd']
    # Apple ~0.65 (holds value), budget Android ~0.25

    # 2. Price-per-GB storage: normalizes across storage tiers
    df['price_per_gb'] = df['launch_price_usd'] / df['storage_gb']

    # 3. Age buckets: phone depreciation is non-linear
    # Drops steeply year 1, flattens after year 2
    df['age_bucket'] = pd.cut(
        df['age_months'],
        bins=[0, 6, 12, 24, float('inf')],
        labels=['0-6mo', '6-12mo', '1-2yr', '2yr+']
    )

    # 4. Flagship flag: premium models depreciate differently
    flagship_models = ['iPhone 14 Pro', 'iPhone 15', 'Galaxy S23 Ultra',
                       'Galaxy S24', 'Pixel 8 Pro']
    df['is_flagship'] = df['model'].isin(flagship_models).astype(int)

    return df

df = engineer_features(df)

Models Evaluated

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.preprocessing import StandardScaler
import xgboost as xgb

feature_cols = ['ram_gb', 'storage_gb', 'battery_mah', 'camera_mp',
                'age_months', 'condition_score', 'has_5g',
                'launch_price_usd', 'depreciation_ratio',
                'price_per_gb', 'is_flagship',
                'brand_Apple', 'brand_Samsung']  # OHE brands

X = df[feature_cols]
y = df['used_price_usd']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

models = {
    'Linear Regression': LinearRegression(),
    'Ridge':             Ridge(alpha=1.0),
    'Lasso':             Lasso(alpha=0.1),
    'Decision Tree':     DecisionTreeRegressor(max_depth=8, random_state=42),
    'Random Forest':     RandomForestRegressor(n_estimators=200, random_state=42),
    'XGBoost':           xgb.XGBRegressor(
                             n_estimators=500,
                             learning_rate=0.05,
                             max_depth=6,
                             subsample=0.8,
                             colsample_bytree=0.8,
                             random_state=42
                         ),
    'SVR':               SVR(kernel='rbf', C=100, gamma=0.1),
}

results = {}
for name, model in models.items():
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    results[name] = {
        'R2':   r2_score(y_test, preds),
        'MAE':  mean_absolute_error(y_test, preds),
        'RMSE': np.sqrt(mean_squared_error(y_test, preds))
    }

results_df = pd.DataFrame(results).T.sort_values('R2', ascending=False)
print(results_df)

Results

ModelR² ScoreMAE (USD)RMSE (USD)
XGBoost0.91$28.4$41.2
Random Forest0.88$33.1$48.7
Gradient Boosting0.87$35.6$51.3
Decision Tree0.79$44.2$64.8
SVR0.76$48.7$69.4
Ridge0.68$57.3$80.1
Linear Regression0.65$61.0$84.7

Linear regression underperformed because phone depreciation is non-linear — price drops steeply in year one and flattens after year two. Tree-based models capture this naturally through recursive splits on age_months.

SHAP Feature Importance

SHAP (SHapley Additive exPlanations) revealed which features actually drove predictions — not just model coefficients:

import shap

explainer = shap.TreeExplainer(models['XGBoost'])
shap_values = explainer.shap_values(X_test)

# Top features by mean |SHAP value|
shap.summary_plot(shap_values, X_test, plot_type="bar")
# Results (approximate ranking):
# 1. launch_price_usd    — 0.38  (most important)
# 2. ram_gb              — 0.22
# 3. age_months          — 0.19
# 4. depreciation_ratio  — 0.14
# 5. brand_Apple         — 0.11
# 6. condition_score     — 0.09
# ...

# Partial dependence: age vs price
shap.dependence_plot('age_months', shap_values, X_test)
# Reveals: steep SHAP drop from 0-12 months, plateau after 24 months

Key finding: Brand alone explained a large portion of price variance. Apple devices depreciate significantly slower — an iPhone 2 years old holds value better than a similarly-specced Android. The depreciation curve is the most interesting shape in the data: non-linear, brand-specific, and poorly captured by linear models.

The Depreciation Curve Visualization

% of Launch Price Age (months) 0 12 24 36 48 100% 75% 50% 25% Apple Android Steep drop months 0-12 Flattens after 24mo

Depreciation curves by brand. Apple retains ~65% value at 24 months; comparable Android ~45%. Both show non-linear decay that linear models can't capture.

What I'd Do Differently Today

This was my first published research. The main lesson: three months of data collection and cleaning, one month of modeling. The 3:1 ratio wasn't a mistake — it was the right investment. Clean, well-featured data with a simple XGBoost beat dirty data with a sophisticated neural network every time we tested it.