📈 Linear Regression & Correlation Calculator
Enter paired x,y data to get the best-fit line, Pearson r, and R² — then predict new y values.
How a Real Estate Analyst Used Linear Regression to Predict Apartment Prices in Pune
Neha Sharma had a problem that many junior analysts face: her manager handed her a spreadsheet with 40 rows of apartment data — carpet area in square feet versus listed price in lakhs — and asked her to build a "predictive model" by end of day. Neha had no machine learning background, no Python, and no R. What she did have was a linear regression formula she half-remembered from her MBA statistics course and, eventually, a tool that crunched the numbers in under a second.
What followed was a masterclass in why simple statistical tools, applied correctly, can outperform complicated models when your data is genuinely linear and your audience needs to understand the math.
What Linear Regression Actually Does
At its core, linear regression finds the straight line that best fits a scatter of data points. That line is described by the equation y = mx + b, where m is the slope — how much y changes per unit increase in x — and b is the y-intercept, the predicted value of y when x is zero.
The "best fit" is defined mathematically using the Ordinary Least Squares (OLS) method: you minimize the sum of the squared vertical distances between each actual data point and the corresponding point on the predicted line. These vertical gaps are called residuals, and squaring them ensures that large errors are penalized more heavily than small ones.
The formulas that drive the calculation are clean and deterministic:
- Slope (m): m = Σ[(xᵢ − x̄)(yᵢ − ȳ)] / Σ[(xᵢ − x̄)²]
- Intercept (b): b = ȳ − m·x̄
- Pearson r: r = Sxy / √(Sxx · Syy)
- R²: simply r², expressed as a proportion from 0 to 1
No guesswork. No iteration. Feed in your data and the answer is exact.
The Pearson Correlation Coefficient: The Signal Before the Line
Before Neha even plotted her data, she looked at the Pearson correlation coefficient — a number between −1 and +1 that tells you how tightly paired x and y values move together in a straight-line fashion.
An r of +1 means perfect positive correlation: every time x goes up, y goes up proportionally. An r of −1 means perfect negative correlation. An r near 0 suggests little to no linear relationship — though it does not rule out curved or other non-linear patterns.
Neha's dataset returned r = 0.87. That is considered a strong positive linear relationship. In plain English: larger apartments reliably cost more, and the relationship is close enough to linear that a straight-line model is a valid starting point.
R-Squared: How Much of the Story Does X Tell?
Neha's R² came out to 0.757, meaning about 75.7% of the variance in apartment prices could be explained purely by carpet area. The remaining 24.3% was influenced by other factors her dataset did not include — floor number, building age, proximity to a metro station, amenities, and so on.
This is an important nuance that trips up many first-time users. R² does not measure accuracy in absolute rupees. It measures how much of the variation in y is captured by the model. A high R² is encouraging, but it does not mean every individual prediction is close — especially at the extremes of your data range, where the model can drift.
A common rule of thumb in social and behavioral sciences is that R² above 0.50 is "good," while in physical sciences you might need 0.95 or higher. For real estate within a single city and property type, 0.75 is genuinely useful.
Standard Error of the Estimate: The Honest Confidence Band
The standard error of the estimate (SEE) is often overlooked by beginners but it is arguably the most practically useful output. It measures the average scatter of actual y values around the regression line, in the same units as y.
For Neha's model, the SEE was approximately 8.2 lakhs. That meant her model's predictions were typically within roughly ±8 lakhs of the actual listed price — a meaningful margin in a market where a 3BHK might list at 85 lakhs.
Armed with this, Neha could tell her manager: "We predict this 900 sq ft flat will list at 72.4 lakhs, plus or minus about 8 lakhs." That is a far more honest and useful statement than a bare point prediction.
Building Intuition Through the Slope
The slope in Neha's model was m ≈ 0.058. Carpet area was measured in square feet and price in lakhs, so this meant: every additional square foot of carpet area adds approximately ₹5,800 to the predicted price. This is an immediately understandable number that a buyer, a broker, or a manager can internalize and sanity-check against their experience of the local market.
This is one of linear regression's great advantages over black-box models: the output is a human-readable equation. You can print it, explain it in a meeting, and hand it to someone who has never heard of statistics.
Where the Model Breaks Down — and Why That Is Fine
Linear regression assumes a genuinely linear relationship. If your data curves — say, income vs. expenditure on luxury goods, which often has a non-linear bend — then a straight line will systematically underfit in some regions and overfit in others. Plotting your residuals (actual minus predicted) is a quick sanity check: if residuals show a curved pattern rather than random scatter, you may need a log transformation or a polynomial term.
It also assumes that errors are roughly constant across all x values (homoscedasticity). In real estate, variance in prices tends to increase at higher price points — a ₹3 crore flat is more idiosyncratic than a ₹50 lakh flat. Violating this assumption does not break the model outright, but it means predictions at the high end carry more uncertainty than the SEE suggests.
Finally, regression does not imply causation. Area causes price in some intuitive sense, but you could run the same formula backwards — regress area on price — and get a different slope. The directional causal story must come from domain knowledge, not from the math.
Practical Uses Across Fields
The same technique that Neha used for apartments underpins a remarkable range of real-world analyses:
- Agriculture: Predicting crop yield from rainfall data across seasons.
- Healthcare: Estimating a patient's resting heart rate from body weight or age.
- Finance: Fitting a stock's beta coefficient (sensitivity to market movements) using monthly return pairs.
- Manufacturing: Predicting machine output rate from operating temperature or maintenance interval.
- Marketing: Modeling how ad spend in rupees maps to conversions, within a budget range where diminishing returns have not yet kicked in.
In each case, the workflow is the same: collect paired observations, verify that the scatter plot looks roughly linear, run the regression, check r and R², inspect the SEE, and interpret the slope in domain-specific language.
Neha's Result — and the Lesson
By end of day, Neha had a one-page summary showing the best-fit equation, correlation strength, explanatory power, and a lookup table for key apartment sizes. Her manager ran a few predictions on apartments they were about to appraise and found the model was off by no more than 9 lakhs on most units — well within the client's decision threshold.
The deeper lesson was not about statistics. It was about knowing which tool is right for which job. Linear regression is not the most powerful model ever built. But it is transparent, fast, auditable, and in cases where the linear assumption holds, it is often accurate enough to be the last model you need.
That combination — speed, interpretability, and reliable accuracy on well-behaved data — is why linear regression has remained a foundational tool for over two centuries, from Gauss and Legendre fitting planetary orbits to Neha fitting apartment prices in Pune.