Andhra Pradesh Legislative Assembly Elections and Estimation Theory

The 2019 Andhra Pradesh Legislative Assembly elections coincided with the Indian general election. The incumbent TDP lost to YSR Congress Party (YSRCP), securing 86% of the Assembly. YSRCP won 151 of 175 seats (gaining 84) while TDP lost 79. Five parties contended: YSRCP, TDP, Janasena (JSP), INC, and BJP. Though predictable, it operated under a different dynamic from what many economists, analysts, and opinion polls expected.

Many political pundits predicted JSP would shift the polling momentum. TDP and YSRCP were confident before the election results. YSRCP won decisively, leaving many awestruck. Inferring election and polling parameters from heterogeneous data is tough. This post is motivated by the 2019 Andhra Pradesh Legislative Assembly elections and political commentary by Prof. K Nageshwar[1], a renowned economist and political analyst.

In nearly all pre-election polls, YSRCP had a slight edge over TDP. JSP was seen as the fulcrum adding randomness for a momentum shift toward either YSRCP or TDP. BJP and INC were brushed off by all the polls. There was nearly a consensus that YSRCP would win, but with a slight edge over TDP, and JSP could alter the contest to make it a tighter battle between TDP and YSRCP.

The contrast between the poll predictions and the actual 2019 results was so dramatic that it surprised many pollsters. The poll average showed YSRCP gained a staggering 22.07 percentage points of the popular vote. The opinion polls, analysis, and forecasts ran throughout the year and the difference between actual and polling results cannot be explained by the margin of errors. Why were they inaccurate? How would one improve their accuracy?

We answer both questions by establishing a framework based on estimation theory. Let us consider the number of demographic groups ready to be sampled from be D and let be the percentage of people from \( d^{th} \) demographic group that eventually end up voting for YSRCP. The population of voters (from past data) is large enough to consider that YSRCP has an influx from \( d^{th} \) group with probability \( \theta_d \). Assume voting decisions are independent most of the time, with a cross tide modeled by a point process (like general Poisson).

For the opinion poll, if we have the pollster randomly selecting nn people to ask for their opinions, 1 is assigned to \( R_{i}^{d} \) if a voter \( i \) (where \(0 \leq i \leq n\) votes for YSRCP, or 0 otherwise. R is denotive of a random variable. A similar test response variable T is shadowed for R to confirm the opinion of the \( i^{th} \) candidate from the \( d^{th} \) group. Let \( C_{i}^{d}\) denote the cost associated with this testing variable. For example, the low-cost path would be to take an SMS poll or ask for an opinion over the phone. A high-cost path would be to cross relate the \( i^{th} \) person's opinion across his affiliations and posts on social media to his opinions over the phone.

If \( i^{th} \) person from \( d^{th} \) group had actually voted for YSRCP, that person may have provided a different opinion when polled with a probability \( \beta_{C_i}^{d} \). And if an opinion polled person actually ended up voting for TDP, he may have provided a different opinion earlier with a probability \( \gamma_{C_i}^{d} \). Both are functions involving \( C_{i}^{d} \) variables.

Since \( i \) depends on \( R_{i}^{d} \), we have \( T_{i}^{d} = 1\) with probability \( (1-\beta_{C_i}^{d}){\theta }_d + (1-{\theta }_d )\gamma_{C_i}^{d} ={P}_1 \). Assuming the \( d^{th} \) demographic group constitutes \( p^{th} \) fraction of Andhra Pradesh voter population, we wish to estimate \( \theta =\sum p_{d}\theta_{d} \). Using Cramér–Rao bound on estimating \( {\theta }_d \), we get the Fisher information \( (I_{\theta _{d}}) \) provided by \(i_{th}\) person from \( d^{th} \) group as:

$$ I_{\theta _{d}} = \frac{(\frac{\partial (1-P_1)}{\partial\theta_d})^2}{1-P_1} + \frac{(\frac{\partial (P_1)}{\partial\theta_d})^2}{P_1} $$ $$ I_{\theta _{d}} = \frac{(1-\beta_{C_i}^{d}-\gamma_{C_i}^{d})^2}{((1-\beta_{C_i}^{d}-\gamma_{C_i}^{d})\theta_d+\gamma_{C_i}^{d})(1-(1-\beta_{C_i}^{d}-\gamma_{C_i}^{d})\theta_d+\gamma_{C_i}^{d}))} $$

The Fisher information provided by the \( d_{th} \) demographic group is summation over the sample, \( F_{\theta_d} = \sum_{i=1}^{n}I_{\theta _{d}} \). It can be seen from this result that maximum information is achieved at \( \frac{1}{\theta_d - \theta_d^2} \) when \( \gamma_{C_i}^{d} = \beta_{C_i}^{d} = 0\).

To improve the accuracy of polling, we need to find optimal values for sample size and cost to be spent. Going for a least mean-squared error (LMSE) of the Cramér–Rao bound of \(\theta\) with a budget B and demography size D, we get a basic optimization problem:

$$ \min_{n, c_i^d} \sum_{d=1}^{D} \frac{{p_d^2}}{{F_\theta}} $$

An interesting result of this analysis is that: contrary to popular opinion, the polling cost should be directly proportional to the Fisher information cost cover of a demographic group and not based on the size of the demography. That means, higher the cost in obtaining an individual Fisher information from demography, higher the budget allocation for that group. We can now reason why data from various opinion polls, though numerous, failed to sense the outcome. The more the number of data channels, any linear combination would fail to give a decent error bound. The analysis deteriorates when quality data mixes with heterogeneous erroneous polls figures.

Now that we have answered the first question as to why the existing polls were inaccurate, let us move on the correctional approach. Here is an optimal strategy to improve the accuracy due to mis-represented polling responses and other heterogeneous noise filled data: The mean-squared error for the estimation where \( B_d \) denotes budget allocated for the \( d^{th} \) group is

$$ (\sum_{d=1}^{D}p_k(\gamma_{C_i}^{d}-\theta_d(\gamma_{C_i}^{d}+\beta_{C_i}^{d})))^2 + \sum_{d=1}^{D}p_k^2(c_d\theta_d - \theta_d^2c_d)/B_d $$

Given the availability of \( B_d \), we can solve for optimum value of \( {C_i}^{d} \) by minimizing over the collections of \( B_d \), i.e. \( \sum_{d=1}^{D}B_d \leq B\). The optimal number of samples thus required is a direct result of the minimization problem stated above. We plug the values of \( {C_i}^{d} \) that we get in terms of \( \theta_d \) into the optimization solution above to get the best strategy for accurate polling with a specified budget B.

The beauty of this result is that it shows the relation between the total budget and the cost spent on sample acquisition is sublinear (of order less than 1/2). And as total budget increases, our quality sampling method spends an order proportional to the square root of the budget, while more than doubling the sample size. This finding, in contrast with the existing methods shines even more, when there is no distorted data — that is, when \( \gamma_{C_i}^{d} \) and \( \beta_{C_i}^{d} \) become 0 — we do not need to keep on spending on the samples as the budget asymptotically approaches infinity. We can also learn the polling saturation point where information doesn't necessarily change with additional data channels.

Looking back in 2019 when various psephologists predicted numbers that were far away from reality, and considering the not so modest volume of the budgets that went behind them, I can only hope the turn towards optimality is made soon.