Posted 2026-01-22Updated 2026-01-22Data Science / Quantitative Trading

10,000x Faster: The Ultimate High-Performance Normal Integration Solution, with Code

In derivatives pricing and large-scale risk backtesting, where every millisecond matters, traditional normal-distribution integration often becomes both ugly code and a performance bottleneck. This article introduces a fully vectorized numerical algorithm that lets you exploit the parallelism of modern CPUs, GPUs, and even TPUs to break through that bottleneck.

Posted 2025-11-12Updated 2025-11-13Data Science

From Chatterjee's Correlation to Tau-Star: Completeness and Power in Independence Testing

In the earlier article Three Measures for Sequence Correlation, I introduced Chatterjee’s correlation coefficient, which can detect nonlinear and non-monotonic associations between variables. Its form is simple, computation is fast, and it is especially suitable for screening input features for nonlinear machine learning models. New research, however, shows that its detection efficiency is inadequate for some non-functional dependence structures. This article first reviews and expands on the concrete mechanics and key properties of Chatterjee’s \(\xi\). It then introduces an alternative: the equally consistent, distribution-free independence test statistic \(\tau^*\).

Posted 2025-03-26Updated 2025-03-26Data Science

A Bayesian-Inference Method for Measuring Box Office Outperformance

The film industry is a market with exceptionally high uncertainty, and box office performance directly affects both the financial condition and market valuation of production companies. The degree of box office outperformance is a key indicator of how well a movie is received by the market and an important variable in evaluating film-company performance. Traditional box office forecasting methods, however, often rely on static point-estimate models. They struggle to capture the dynamic evolution of box office revenue over time, and they are even less capable of quantifying how far box office results exceed market expectations and with what level of uncertainty. As a result, they are difficult to convert into actionable investment strategies.

Posted 2024-11-18Updated 2024-11-18Data Science

Learning to Rank in Brief: From RankNet to LambdaMART

This article gives a concise introduction to the background, taxonomy, and evaluation metrics of learning to rank, then reviews several classic algorithms from its development history. The goal is not to cover every technical detail, but to build an intuitive understanding of the core concepts behind ranking, as preparation for practical LTR applications in different scenarios.

Posted 2024-11-05Updated 2025-11-13Data Science

Shaputa: A Feature Selection Method That Combines SHAP and Boruta

Shaputa is a hybrid feature selection technique that combines SHAP (SHapley Additive exPlanations) with Boruta’s shadow-feature mechanism. By constructing a random baseline for every feature and comparing it with model-derived SHAP importance, Shaputa produces more robust and structure-aware feature selection results in high-dimensional, complex datasets. It is especially useful in nonlinear settings where traditional methods often struggle.

Posted 2024-08-26Updated 2024-08-26Data Science / Quantitative Trading

Trends in Large Language Models and Their Applications in Quantitative Investing

Large language models have become the hottest and fastest-moving frontier in computing since ChatGPT burst onto the scene at the end of 2022. LLMs have gradually shed the role of mere tools and become independent carriers of intelligence that can handle tasks on their own. In 2024, hundred-billion-parameter models such as Qwen2 and Llama3.1 were released one after another, and the performance of open-source models has moved steadily closer to that of closed-source systems. For quantitative researchers, a necessary question is how to combine LLM technology with financial applications to improve research and investment. This article first draws on Li Mu’s talk to discuss LLM trends and best practices, and then reviews specific application scenarios for LLMs in quantitative investing.

Posted 2024-01-28Updated 2024-01-28Data Science

SHAP: The Theoretically Optimal Machine Learning Explanation Algorithm

SHAP (SHapley Additive exPlanations) is a game-theoretic, model-agnostic approach to machine learning interpretability. It can quantify each feature’s contribution to a single prediction while also aggregating local explanations into a global view of the model. SHAP has strong theoretical guarantees and, thanks to substantial engineering optimization, is also practical in real-world workflows. This article introduces the core theory behind SHAP and shows several example visualizations.

Posted 2023-12-02Updated 2023-12-02Data Science

Three Measures of Correlation: From Linear to Nonlinear

This article introduces three indicators for measuring the correlation of variables: Pearson correlation coefficient, Spearman correlation coefficient and Chatterjee correlation coefficient. The last one can easily measure non-linear correlation relationships.

Posted 2022-10-03Updated 2022-10-15Data Science

Houyi Project Technical Report - How to Win an Intelligent War in Apex Legends

HOUYI is positioned as a general-purpose aiming assistant for electronic shooting games. It currently supports Apex Legends, abbreviated below as apex. Its runtime framework is a two-stage pipeline of detection and tracking: first, object detection is used to identify enemies on the screen, and then the crosshair position is computed and corrected automatically. The mission of HOUYI is to use computer technology to resist the alienation imposed by electronic shooting games.

Posted 2022-03-30Updated 2024-03-04Data Science

A Hitchhiker's Guide to the Age of Big Data

Don’t Panic

Introduction

There is, or will soon be, a problem in this society: most people produce data most of the time, are governed by data at the same time, and yet know almost nothing about it. Humanity has produced a large number of articles in response, but most of them begin with combinations of mathematical symbols, which is strange, because the things that are ignorant are not those symbols.

The purpose of this article is to give you everything you need to enter the age of big data, even if you have never heard of calculus. It may look long, and it certainly contains some false or at least imprecise statements, but it surpasses more advanced and old-fashioned academic works in two extremely important ways.

First, it is completely free. Second, at the beginning, in large friendly letters, it says “Don’t Panic”. Its content also lets readers skip difficult formulas and directly obtain the high-level picture. I would call this approach math-free.