This post covers the research behind my IEEE paper "Forecasting the Prices using Machine Learning Techniques: Special Reference to used Mobile Phones", published at the Second International Conference on Augmented Intelligence and Sustainable Systems (ICAISS 2023). Read the full paper on IEEE Xplore →
The Problem
The secondhand smartphone market is enormous but opaque. Sellers price phones based on gut feeling, and buyers overpay or miss good deals because there's no transparent pricing model. Sites like eBay and OLX show wildly varying prices for the same model in the same condition. Can machine learning build a reliable price predictor?
The answer is yes — but the real challenge was in the data, not the model.
End-to-End ML Pipeline
Dataset Construction
We collected data on used smartphones across multiple platforms, capturing 13 features:
| Feature | Type | Notes |
|---|---|---|
brand | Categorical | Apple, Samsung, OnePlus, Xiaomi, etc. |
model | Categorical | High cardinality — grouped by generation |
ram_gb | Numeric | 4, 6, 8, 12, 16 GB |
storage_gb | Numeric | 64, 128, 256, 512 GB |
battery_mah | Numeric | 2000–6000 mAh range |
camera_mp | Numeric | Primary rear camera MP |
screen_size_inch | Numeric | 5.0–7.0 inches |
age_months | Numeric | Months since original release |
condition | Ordinal | New > Like-new > Good > Fair |
has_5g | Binary | 0/1 |
os | Categorical | iOS / Android |
launch_price_usd | Numeric | Original MSRP at launch |
used_price_usd | Numeric | Target variable |
import pandas as pd
import numpy as np
from scipy import stats
df = pd.read_csv('used_phones.csv')
# Remove statistical outliers (prices > 3 std from mean per model)
df = df.groupby('model').apply(
lambda x: x[np.abs(stats.zscore(x['used_price_usd'])) < 3]
).reset_index(drop=True)
# Condition: ordinal encoding (order matters)
condition_map = {'New': 4, 'Like New': 3, 'Good': 2, 'Fair': 1}
df['condition_score'] = df['condition'].map(condition_map)
print(f"Final dataset: {len(df)} listings, {df['brand'].nunique()} brands")
Feature Engineering
Raw features aren't always the most predictive. Three engineered features improved model performance significantly:
def engineer_features(df):
# 1. Depreciation ratio: how fast does this brand/model lose value?
# Captures brand premium and perceived durability
df['depreciation_ratio'] = df['used_price_usd'] / df['launch_price_usd']
# Apple ~0.65 (holds value), budget Android ~0.25
# 2. Price-per-GB storage: normalizes across storage tiers
df['price_per_gb'] = df['launch_price_usd'] / df['storage_gb']
# 3. Age buckets: phone depreciation is non-linear
# Drops steeply year 1, flattens after year 2
df['age_bucket'] = pd.cut(
df['age_months'],
bins=[0, 6, 12, 24, float('inf')],
labels=['0-6mo', '6-12mo', '1-2yr', '2yr+']
)
# 4. Flagship flag: premium models depreciate differently
flagship_models = ['iPhone 14 Pro', 'iPhone 15', 'Galaxy S23 Ultra',
'Galaxy S24', 'Pixel 8 Pro']
df['is_flagship'] = df['model'].isin(flagship_models).astype(int)
return df
df = engineer_features(df)
Models Evaluated
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.preprocessing import StandardScaler
import xgboost as xgb
feature_cols = ['ram_gb', 'storage_gb', 'battery_mah', 'camera_mp',
'age_months', 'condition_score', 'has_5g',
'launch_price_usd', 'depreciation_ratio',
'price_per_gb', 'is_flagship',
'brand_Apple', 'brand_Samsung'] # OHE brands
X = df[feature_cols]
y = df['used_price_usd']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
models = {
'Linear Regression': LinearRegression(),
'Ridge': Ridge(alpha=1.0),
'Lasso': Lasso(alpha=0.1),
'Decision Tree': DecisionTreeRegressor(max_depth=8, random_state=42),
'Random Forest': RandomForestRegressor(n_estimators=200, random_state=42),
'XGBoost': xgb.XGBRegressor(
n_estimators=500,
learning_rate=0.05,
max_depth=6,
subsample=0.8,
colsample_bytree=0.8,
random_state=42
),
'SVR': SVR(kernel='rbf', C=100, gamma=0.1),
}
results = {}
for name, model in models.items():
model.fit(X_train, y_train)
preds = model.predict(X_test)
results[name] = {
'R2': r2_score(y_test, preds),
'MAE': mean_absolute_error(y_test, preds),
'RMSE': np.sqrt(mean_squared_error(y_test, preds))
}
results_df = pd.DataFrame(results).T.sort_values('R2', ascending=False)
print(results_df)
Results
| Model | R² Score | MAE (USD) | RMSE (USD) |
|---|---|---|---|
| XGBoost | 0.91 | $28.4 | $41.2 |
| Random Forest | 0.88 | $33.1 | $48.7 |
| Gradient Boosting | 0.87 | $35.6 | $51.3 |
| Decision Tree | 0.79 | $44.2 | $64.8 |
| SVR | 0.76 | $48.7 | $69.4 |
| Ridge | 0.68 | $57.3 | $80.1 |
| Linear Regression | 0.65 | $61.0 | $84.7 |
Linear regression underperformed because phone depreciation is non-linear — price drops steeply in year one and flattens after year two. Tree-based models capture this naturally through recursive splits on age_months.
SHAP Feature Importance
SHAP (SHapley Additive exPlanations) revealed which features actually drove predictions — not just model coefficients:
import shap
explainer = shap.TreeExplainer(models['XGBoost'])
shap_values = explainer.shap_values(X_test)
# Top features by mean |SHAP value|
shap.summary_plot(shap_values, X_test, plot_type="bar")
# Results (approximate ranking):
# 1. launch_price_usd — 0.38 (most important)
# 2. ram_gb — 0.22
# 3. age_months — 0.19
# 4. depreciation_ratio — 0.14
# 5. brand_Apple — 0.11
# 6. condition_score — 0.09
# ...
# Partial dependence: age vs price
shap.dependence_plot('age_months', shap_values, X_test)
# Reveals: steep SHAP drop from 0-12 months, plateau after 24 months
Key finding: Brand alone explained a large portion of price variance. Apple devices depreciate significantly slower — an iPhone 2 years old holds value better than a similarly-specced Android. The depreciation curve is the most interesting shape in the data: non-linear, brand-specific, and poorly captured by linear models.
The Depreciation Curve Visualization
Depreciation curves by brand. Apple retains ~65% value at 24 months; comparable Android ~45%. Both show non-linear decay that linear models can't capture.
What I'd Do Differently Today
- Live pricing data — scrape continuously and build an online learning model rather than a static snapshot. Used phone prices shift with new model releases.
- NLP features from listing text — "minor scratches" vs "perfect condition" carries strong price signal. A fine-tuned sentence-BERT encoder on listing descriptions would add meaningful features.
- Bayesian hyperparameter optimization — we used grid search; Optuna or Ray Tune would have found better XGBoost params faster.
- Stacked ensemble — XGBoost + Random Forest + Ridge stacked would likely push R² above 0.93.
- Seller behavior features — days listed, number of views, seller rating. These predict willingness-to-negotiate, not just fair market value.
This was my first published research. The main lesson: three months of data collection and cleaning, one month of modeling. The 3:1 ratio wasn't a mistake — it was the right investment. Clean, well-featured data with a simple XGBoost beat dirty data with a sophisticated neural network every time we tested it.