Saturday 19 June 2021

FEH VG predictor continued: wave pattern and early estimation

Building a model for VG is something that I wanted to do for a long time. In the previous article I wrote about the basics of a VG model and the article concluded with the chart below:

The perfect wave in the first 12 hours caught my eyes -- is that a coincidence or is that a general phenomenon? The aim of this article is to look into further patterns that help us to build the model. Before we start recall the terms that I used in the previous article -- please refer to the previous article for further details.

- Three examples all extracted from VG June 2021. Please refer to the previous article for further details. You can extract the raw data from the Japanese predictor made by @rammtiger_n. 

Example 1: Final (Popularity ratio >4)
Example 2: Quarterfinal 4 (Popularity ratio 1~2)
Example 3: Semifinal 2 (Popularity ratio close to 1)

- Parameter $k$: the parameter so that the accumulated score is of order $O(t^{2+k})$, or that the team's activity is of order $O(t^{1+k})$. To be more precise, for team i (i = 1,2) define $c_i(t)$ to be the constant factor which scales upon team size, and switch between three values according to the state of the hour, and $f_i(t)$ is the corresponding hour multiplier (which can either be $1.05+0.05t$ or $3.2+0.2t$). Ignoring intraday variation we assume that the team activity $A_i(t)$ is approximated by $c_i(t) f_i(t)^{1+k}$.

One should note that this parameter for the two teams are not necessarily the same, but they are close enough for most of the time. Let us assume that parameter $k$ is uniform across the two teams first.


The chart showed at the beginning is what happened in example 3. The curves are easily spotted because it is a perfect ping-pong where activity of the two teams are almost equal. At the same time when a team is in the excited state the other must be in the post-excitement state as it is exhausted due to bonus multiplier at the previous hour. As a result we find two perfect curves with alternating dots, one for the activity at excited state, another one for the activity at post-excitement state.

We do not have a perfect ping-pong most of the time, so are there any ways to extract such trend if it exists? One approach is to assign a factor to the three states: we may assume that the normalized activity in the excited state is 10 times the normal activity and 100 times of the post-excitement activity. Although we can explain this by the fact that flags comes in a multiplier of 100, such ratio is still affected by the parameter $k$, which we do not want to fix. 

There is a smarter way to get around this: observe that the state of the two teams are almost always excited + post-excitement or normal + normal. On rare occasions it could be normal + excited or normal + post-excitement but they always cancel out. Therefore we can simply take the (geometric) average of the (normalized) activity to retrieve the trend!

Mathematically, we first guessed that the parameter to be $k_0$. We then normalize the activity by considering $A_i(t)/(f_i(t))^{1+k_0}$. By taking the geometric mean we have that
$GM = (c_1(t)c_2(t))^{1/2}(f_1(t)f_2(t))^{(k-k_0)/2}$.
If we are either in the excited + post-excitement or normal + normal states, then $\sqrt{c_1(t)c_2(t)} = c_S$ is a constant. Since $f_1(t)f_2(t)$ is always $\Theta (t^2)$, we know that the geometric mean is constant (or regressed to be constant) if and only if $k=k_0$, i.e., if the estimated parameter $k_0$ meets the true parameter. We take log GM instead of GM to even out the impact of normal + excited states against normal + post-excitement states.

As a demonstration we calculate the log-geometric mean team activity for example 1 we get the following chart (with the guess of $k=1$):

We can see a downward trend starting from hour number 8, indicating that $k=1$ is an overestimate here. 

Again we retrieved the same early wavy pattern as in the first chart. It has a simple explanation: in FEH there are quests to clear. You need to clear these simple quests to get the (maximum number of) flags. The quests are mostly "clear VG with red/blue/green/colorless unit", but they require you to enter VG actually. On the other hand, you start the event with zero vote so you cannot do these quests right away. Most people do these quests with votes almost fully restored, which is exactly 4-8 hours into the event. 

Now we can estimate $k$ by removing the first 4 hours as outliers and search for $k_0$ such that the linear regression returns a zero slope. Since the regressed slope is strictly decreasing with $k_0$ we can always find such $k_0$.

If we apply that on example 1 we estimate $k$ to be 0.85: 

And if we apply that on example 2 we estimate $k$ to be 1.17:

The wavy pattern seems to be very consistent among all situations: we always observe two peaks, one at hour number 4 (which corresponds to 8 hours into the event since we removed the first four) and another one at hour number 12 (16 hours into the game). We may interpret these as the activity peak from players in different part of the world. Computationally the peak and troughs helps us greatly in the sense that we can do the same linear regression using the first two peaks and troughs, i.e., the data of the first 20 hours, and the result is highly correlated to the estimate using all 44 hours of data.

Example 1: $k$ estimated to be 0.8 with the data of hour number 5~20 vs 0.8 on global data

Example 2: $k$ estimated to be 0.92 with the data of hour number 5~20 vs 0.92 on global data

It seems that such estimation is always an underestimate due to (out-of-correlation) increased activity at the far end, but we can always add a little bit to our estimate. 


So, what can we do with the predictor now? This is a purposed way of creating a prediction:

- Use the early data to estimate the constant factor for teams' activity with $k_0=1$
- Predict by combining team activity and states guessing
- Analyze team composition by wave decomposition at hour number 20 and modify $c_i(t)$ accordingly
- Update $k_0$ by linear regression every time before iterating through the prediction after hour number 12 or 20
- ???
- Feathers!

As much as the above being a big and serious discussion, I still prefer participating the event in a simple way by guessing frequency of the bonus hours linearly. A 99% accurate predictor? Sure but no thanks if I am the one to write the codes. Not to mention that it is actually quite hard to measure the error in a dynamic system and we just can't tell in a mathematically rigorously way that how accurate our predictor could be...

The charts were not properly imported onto google drive, but you can plot them easily. Column L-N are time-normalized difference and bonus boundaries with $k=1$. Column T is the log-geometric mean of team activities. You can change $k$ as you like at W4 and W5, but the $k$ for the two teams are by default equal. The three labels are SF2, F and QF4 which correspond to examples 3, 1 and 2 respectively.

No comments:

Post a Comment