Cherry Blossom Peak Bloom Predictions

Author

Student

Published

February 28, 2025

Abstract

This experiment attempts to forecast peak bloom dates by using a temperature-based time-series pattern matching technique. By using historical weather data and bloom observations, the method identifies analogous climate sequences from past years to project future bloom timings.

The framework used is Fast Dynamic Time Warping (FastDTW), which compares recent temperature patterns with historical patterns, determining the most probable position in the bloom cycle. The dataset consists of daily temperature records for many years, formatted to represent the cycles of dormancy and bloom. To improve predictive accuracy, we introduce derived features including cumulative degree days, rolling temperature sums, and sum of squares of temperature values, and more.

When testing the model, partial temperature sequences from historical bloom years are selected and compared in alignment to actual bloom dates. By iteratively sliding time windows over historical records, we can evaluate how closely predicted bloom dates match observed events through minimizng the Euclidean norm.

Despite the many limitations in this design, it may prove beneficial to track bloom dates in terms of cyles and patterns. This tool of sliding windows to find patterns in the data can allow us to forecast any time-series datasets.

Introduction

This project aims to predict Cherry Blossom peak bloom dates (PBD) for 2025 at 5 locations around the world. Historical data on Cherry Blossom PBD was used to model bloom cycles, where the beginning of the cycle represents one day after previous PBD, and the final day represents the current PBD. These cycles are compared to one another, mapping similar trends on functions of temperature. In particular, the days leading up to the bloom were closely analyzed, allowing us to theoretically input a slice of current temperature data in order to predict how far along the cycle we are, thus generating an estimate for the future PBD.

Setup

This project uses python and related packages. Please ensure python is installed, as well as all other dependencies listed below:

import pandas as pd 
import numpy as np 
import datetime as dt
import matplotlib.pyplot as plt
from fastdtw import fastdtw
import warnings
warnings.filterwarnings('ignore')

# Import Bloom Datasets
washingtondc = pd.read_csv('data/washingtondc.csv')
liestal = pd.read_csv('data/liestal.csv')
kyoto = pd.read_csv('data/kyoto.csv')
vancouver = pd.read_csv('data/vancouver.csv')
newyorkcity = pd.read_csv('data/nyc.csv')

# Create and engineer relevant features for Washington DC 
# Years not included (missing data): 2000 
wdc_weather_df = pd.read_csv('data/wdc_weather1990_2024.csv')
wdc_weather_df.reset_index(drop=True, inplace=True)
wdc_weather_df['Date'] = pd.to_datetime(wdc_weather_df['Date'])
wdc_weather_df.sort_values(by='Date',inplace=True)
wdc_weather_df.drop(columns=['index'], inplace=True)
wdc_weather_df

	Temp_max	Temp_avg	Temp_min	Dew_max	Dew_avg	Dew_min	Hum_max	Hum_avg	Hum_min	Wind_max	Wind_avg	Wind_min	Pres_max	Pres_avg	Pres_min	Precipitation	Date
0	46	39.4	0	45	25.3	0	96	55.0	0	24	17.5	8	30.3	29.9	29.6	0.00	1990-01-01
1	46	36.3	29	28	23.0	20	86	60.5	39	12	8.0	6	30.5	30.4	30.3	0.00	1990-01-02
2	50	37.0	30	30	27.2	22	89	70.2	39	12	5.8	2	30.4	30.3	30.3	0.00	1990-01-03
3	55	32.8	0	51	27.3	0	97	74.4	0	17	10.4	3	30.3	30.0	29.9	0.00	1990-01-04
4	51	45.4	42	46	36.5	31	100	72.8	50	15	8.0	0	30.2	30.1	29.9	0.00	1990-01-05
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
12412	49	42.0	36	38	30.5	27	82	64.4	50	7	3.0	0	30.5	30.4	30.3	0.00	2024-12-27
12413	53	47.0	42	50	44.2	39	94	89.8	86	12	4.6	0	30.3	30.2	30.2	0.29	2024-12-28
12414	67	56.5	50	56	51.6	46	93	84.2	59	21	8.0	0	30.2	30.0	29.6	0.35	2024-12-29
12415	63	55.8	44	53	37.5	27	86	54.4	27	21	9.0	3	29.9	29.7	29.6	0.03	2024-12-30
12416	63	50.3	40	47	40.5	35	90	71.5	41	32	8.2	0	29.9	29.7	29.5	0.00	2024-12-31

12417 rows × 17 columns

wdc_bloomdates_df = pd.DataFrame(washingtondc)
wdc_bloomdates_df = wdc_bloomdates_df[wdc_bloomdates_df['year'] >= 1990]
wdc_bloomdates_df = wdc_bloomdates_df.drop(wdc_bloomdates_df[wdc_bloomdates_df['year'] == 2000].index)
wdc_bloomdates_df['bloom_date'] = pd.to_datetime(wdc_bloomdates_df['bloom_date'])
wdc_bloomdates_df.reset_index(drop=True, inplace=True)
wdc_bloomdates_df

	location	lat	long	year	bloom_date	bloom_doy
0	washingtondc	38.88535	-77.038628	1990	1990-03-15	74
1	washingtondc	38.88535	-77.038628	1991	1991-03-29	88
2	washingtondc	38.88535	-77.038628	1992	1992-04-07	98
3	washingtondc	38.88535	-77.038628	1993	1993-04-11	101
4	washingtondc	38.88535	-77.038628	1994	1994-04-05	95
5	washingtondc	38.88535	-77.038628	1995	1995-04-02	92
6	washingtondc	38.88535	-77.038628	1996	1996-04-04	95
7	washingtondc	38.88535	-77.038628	1997	1997-03-26	85
8	washingtondc	38.88535	-77.038628	1998	1998-03-27	86
9	washingtondc	38.88535	-77.038628	1999	1999-04-05	95
10	washingtondc	38.88535	-77.038628	2001	2001-04-06	96
11	washingtondc	38.88535	-77.038628	2002	2002-04-02	92
12	washingtondc	38.88535	-77.038628	2003	2003-04-02	92
13	washingtondc	38.88535	-77.038628	2004	2004-03-31	91
14	washingtondc	38.88535	-77.038628	2005	2005-04-09	99
15	washingtondc	38.88535	-77.038628	2006	2006-03-30	89
16	washingtondc	38.88535	-77.038628	2007	2007-04-01	91
17	washingtondc	38.88535	-77.038628	2008	2008-03-26	86
18	washingtondc	38.88535	-77.038628	2009	2009-04-01	91
19	washingtondc	38.88535	-77.038628	2010	2010-03-31	90
20	washingtondc	38.88535	-77.038628	2011	2011-03-29	88
21	washingtondc	38.88535	-77.038628	2012	2012-03-20	80
22	washingtondc	38.88535	-77.038628	2013	2013-04-09	99
23	washingtondc	38.88535	-77.038628	2014	2014-04-10	100
24	washingtondc	38.88535	-77.038628	2015	2015-04-10	100
25	washingtondc	38.88535	-77.038628	2016	2016-03-25	85
26	washingtondc	38.88535	-77.038628	2017	2017-03-25	84
27	washingtondc	38.88535	-77.038628	2018	2018-04-05	95
28	washingtondc	38.88535	-77.038628	2019	2019-04-01	91
29	washingtondc	38.88535	-77.038628	2020	2020-03-20	80
30	washingtondc	38.88535	-77.038628	2021	2021-03-28	87
31	washingtondc	38.88535	-77.038628	2022	2022-03-21	80
32	washingtondc	38.88535	-77.038628	2023	2023-03-23	82
33	washingtondc	38.88535	-77.038628	2024	2024-03-17	77

Now we have two dataframes:

wdc_weather_df, daily weather data wdc_bloomdates_df, with bloom dates for each year

Since we want to compare a bloom cycle, i.e from dormancy to bloom, we will count each year’s bloom as starting from the day after the previous bloom and ending with the current bloom.

temp_of_day_from_bloom_df:

rows: day of bloom (1 denotes first day after previous bloom)
column: cycle (year)

For example, if the column name is 1992_b, this means that this column the days leading up to the bloom in 1992. Thus the first row will take place in 1991, after the previous bloom 1991_b. The last value, prior to any NaN, is the bloom date.

Each column represents one cycle, from post-bloom dormancy to the next bloom.

We will also take the inverse of the dataframe, and call it temp_of_day_till_bloom_df:

rows: day till bloom (1 denotes one day prior to the current bloom, and 0 be the bloom itself.)
column: cycle (year)

bloom_dates = wdc_bloomdates_df['bloom_date']
temp_of_day_from_bloom_df = pd.DataFrame() 

for i in range(len(bloom_dates) - 1):
    data = wdc_weather_df[(wdc_weather_df['Date'] > bloom_dates[i]) & (wdc_weather_df['Date'] <= bloom_dates[i+1])]
    data.reset_index(drop=True, inplace=True)
    data["day_from_bloom"] = data.index + 1
    pivot_df = data.pivot_table(index='day_from_bloom', values='Temp_avg')
    pivot_df.columns=[str(bloom_dates[i+1].year)+'_b']
    temp_of_day_from_bloom_df = pd.concat([temp_of_day_from_bloom_df, pivot_df], axis=1)

temp_of_day_till_bloom_df = pd.DataFrame()
inv_temp_of_day_from_bloom_df = temp_of_day_from_bloom_df.iloc[::-1]

for col in inv_temp_of_day_from_bloom_df.columns:
    nan_count = inv_temp_of_day_from_bloom_df[col].isna().sum()

    # Remove NaN
    temp_of_day_till_bloom_df[col] = inv_temp_of_day_from_bloom_df[col].iloc[nan_count:].reset_index(drop=True)

temp_of_day_till_bloom_df = temp_of_day_till_bloom_df.head(330)
temp_of_day_till_bloom_df['days_till_bloom'] = temp_of_day_till_bloom_df.index
temp_of_day_till_bloom_df

	1991_b	1992_b	1993_b	1994_b	1995_b	1996_b	1997_b	1998_b	1999_b	2001_b	...	2016_b	2017_b	2018_b	2019_b	2020_b	2021_b	2022_b	2023_b	2024_b	days_till_bloom
0	36.7	57.1	53.3	55.8	42.7	56.6	52.5	65.7	51.1	58.1	...	64.3	64.9	45.2	41.8	70.9	59.8	58.1	65.3	59.3	0
1	71.9	50.8	27.7	51.3	45.3	54.7	47.4	58.2	64.1	52.8	...	64.6	47.2	55.5	55.3	56.3	61.5	55.6	56.0	56.5	1
2	59.5	45.8	43.0	59.1	47.7	45.5	42.3	43.1	58.2	51.7	...	61.4	38.5	48.7	63.8	51.4	70.7	65.8	48.0	65.0	2
3	50.1	41.0	52.8	55.8	51.8	24.4	44.0	41.2	60.1	49.4	...	49.3	47.2	45.8	59.3	51.1	58.6	54.2	41.1	61.5	3
4	51.8	38.4	47.8	51.8	48.3	52.3	54.8	42.0	58.3	48.4	...	44.2	53.6	55.7	49.6	46.2	56.4	54.8	36.9	58.3	4
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
325	67.6	61.1	64.3	69.0	64.0	52.3	50.9	63.0	69.5	61.0	...	72.7	56.2	66.0	73.9	66.9	52.0	64.2	65.1	60.5	325
326	63.0	50.4	60.4	62.7	61.3	35.4	55.9	60.4	58.0	62.0	...	71.5	64.1	65.3	73.0	54.9	53.6	75.6	58.9	56.9	326
327	59.6	75.8	60.8	65.2	62.3	67.4	65.3	58.1	57.8	59.7	...	69.3	62.1	53.5	68.9	60.0	66.3	73.9	57.1	53.4	327
328	44.3	77.3	46.3	67.7	60.5	65.4	63.4	66.3	53.5	65.6	...	63.5	54.6	54.4	67.5	63.3	66.1	66.5	54.8	53.3	328
329	50.2	68.7	48.5	75.0	43.2	39.1	56.5	60.3	54.9	71.1	...	57.6	53.7	55.9	66.8	65.4	61.6	57.0	49.3	57.4	329

330 rows × 34 columns

Now, let’s add some more complicated features onto the dataframe. We will look at functions of temperature, as it is theorized that bloom cycles occur after certain thresholds of temperature are met. The following functions were used as features:

All data is created and stored in the new dataframe, df.

Each feature name begins with the year of bloom, followed by the following suffix. For example, the 1991 bloom cycle will have the features:

1991_b : daily average Temperature
1991_b_sum_sq : sum of squared daily temperature, rolling over last 21 days
1991_b_sum_log : sum of log daily temperature, rolling over last 21 days

… and so on. Below are definitions of all features

Feature	10
_sum_sq	sum of squared daily temperature, rolling over last 21 days
_sum_log	sum of log daily temperature, rolling over last 21 days
_sum_dd32	sum of number of days where temperature exceeds 32 F, rolling over last 21 days
_sum_dd50	sum of number of days where temperature exceeds 50 F, rolling over last 21 days
_sum_last14	sum of daily temperature, rolling over last 14 days
_norm_dd_32	sum of days where days above 32F are (+1) and days below are (-1), rolling over last 21 days
_norm_dd_50	sum of days where days above 50F are (+1) and days below are (-1), rolling over last 21 days
_rolling_freeze	sum of number of days where temperature is below 32 F, rolling over last 21 days
_rolling_chill	sum of number of days where temperature is between 32 F and 50 F, rolling over last 21 days
_rolling_temp_change	temperature change between last 7 days

df = temp_of_day_till_bloom_df.copy()

# Functions of temperature for each bloom cycle
for col in df.columns:
    if col.endswith('_b'):
        df[f'{col}_sum_sq'] = df[col][::-1].pow(2).rolling(window=21, min_periods=1).sum()[::-1]
        df[f'{col}_sum_log'] = df[col][::-1].apply(lambda x: np.log(x + 1)).rolling(window=21, min_periods=1).sum()[::-1]
        df[f'{col}_sum_dd32'] = (df[col][::-1] > 32).rolling(window=21, min_periods=1).sum()[::-1]
        df[f'{col}_sum_dd50'] = (df[col][::-1] > 50).rolling(window=21, min_periods=1).sum()[::-1]
        df[f'{col}_sum_last14'] = df[col][::-1].rolling(window=14, min_periods=1).sum()[::-1]
        df[f'{col}_norm_dd_32'] = ((df[col] > 32).astype(int) * 1 + (df[col] <= 32).astype(int) * -1).rolling(window=21, min_periods=1).sum()[::-1]
        df[f'{col}_norm_dd_50'] = ((df[col] > 50).astype(int) * 1 + (df[col] <= 50).astype(int) * -1).rolling(window=21, min_periods=1).sum()[::-1]
        df[f'{col}_rolling_freeze'] = (df[col][::-1] < 32).rolling(window=21, min_periods=1).sum()[::-1]
        df[f'{col}_rolling_chill'] = ((df[col][::-1] >= 32) & (df[col][::-1] < 50)).rolling(window=21, min_periods=1).sum()[::-1]
        df[f'{col}_rolling_temp_change'] = df[col][::-1].diff(periods=7)[::-1]

cols = ['days_till_bloom'] + [c for c in df.columns if c != 'days_till_bloom']
df = df[cols]
df = df.head(80)
df

	days_till_bloom	1991_b	1992_b	1993_b	1994_b	1995_b	1996_b	1997_b	1998_b	1999_b	...	2024_b_sum_sq	2024_b_sum_log	2024_b_sum_dd32	2024_b_sum_dd50	2024_b_sum_last14	2024_b_norm_dd_32	2024_b_norm_dd_50	2024_b_rolling_freeze	2024_b_rolling_chill	2024_b_rolling_temp_change
0	0	36.7	57.1	53.3	55.8	42.7	56.6	52.5	65.7	51.1	...	60490.21	83.770057	21.0	15.0	770.8	1.0	1.0	0.0	6.0	13.3
1	1	71.9	50.8	27.7	51.3	45.3	54.7	47.4	58.2	64.1	...	58387.48	83.323977	21.0	14.0	765.1	2.0	2.0	0.0	7.0	7.8
2	2	59.5	45.8	43.0	59.1	47.7	45.5	42.3	43.1	58.2	...	57202.27	83.096476	21.0	13.0	756.0	3.0	3.0	0.0	8.0	12.9
3	3	50.1	41.0	52.8	55.8	51.8	24.4	44.0	41.2	60.1	...	55139.52	82.767551	21.0	12.0	734.9	4.0	4.0	0.0	9.0	3.9
4	4	51.8	38.4	47.8	51.8	48.3	52.3	54.8	42.0	58.3	...	53054.71	82.374805	21.0	11.0	714.6	5.0	5.0	0.0	10.0	5.4
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
75	75	36.6	28.1	30.6	7.9	42.0	23.9	38.0	43.8	44.3	...	42792.07	80.265438	21.0	4.0	629.6	5.0	-19.0	0.0	17.0	-6.1
76	76	30.5	29.0	34.2	2.1	46.6	35.3	32.5	44.5	45.2	...	42665.03	80.226121	21.0	4.0	627.4	7.0	-19.0	0.0	17.0	-6.4
77	77	13.3	36.8	45.2	18.3	42.4	24.9	33.2	56.7	48.7	...	42521.03	80.182211	21.0	4.0	638.0	9.0	-19.0	0.0	17.0	-0.8
78	78	41.5	29.8	43.1	12.0	44.5	31.9	39.0	62.0	45.3	...	43692.63	80.408414	21.0	5.0	643.8	11.0	-19.0	0.0	16.0	1.1
79	79	31.8	20.1	23.9	11.5	53.1	24.1	52.4	60.5	39.0	...	43921.88	80.461788	21.0	5.0	645.5	13.0	-19.0	0.0	16.0	7.5

80 rows × 364 columns

Let’s see what our data looks like.

# Show raw data

plt.figure(figsize=(10, 10))

for year in temp_of_day_from_bloom_df.columns:
    bloom_cycle = df[year]
    plt.plot(bloom_cycle.index, bloom_cycle.values, label=str(year))

plt.xlabel('Number of Days Till Bloom')
plt.ylabel('Temperature in F')
plt.title('Temperature Over Bloom Cycles')
plt.legend(title="Year", bbox_to_anchor=(1, 1), loc='upper left')
plt.show()

window = 7

plt.figure(figsize=(10, 10))  

for year in temp_of_day_from_bloom_df.columns:
    bloom_cycle = df[year]
    smoothed = bloom_cycle.rolling(window=window).mean()  
    plt.plot(smoothed.index, smoothed.values, label=str(year))

plt.xlabel('Number of Days Till Bloom')
plt.ylabel('Temperature (Temp Avg), in F')
plt.title('Smoothed Temperature Change Over Bloom Cycles')
plt.legend(title="Year", bbox_to_anchor=(1.05, 1), loc='upper left')

plt.tight_layout()
plt.show()

Looking at continuous functions of temperature:

feature_suffixes_cont = ['_sum_sq', '_sum_log', '_sum_last14', '_rolling_temp_change']
for suffix in feature_suffixes_cont:
    cv = df[[col for col in df.columns if col.endswith(suffix)]]
    plt.figure(figsize=(10, 3)) 
    plt.xlabel('Number of Days Till Bloom')
    plt.xticks(np.arange(0, 115, 5))
    plt.title('Patterns of ' + suffix)
    plt.ylabel(suffix)
    for year in cv.columns:
        bloom_cycle = df[year]
        plt.plot(bloom_cycle.index,bloom_cycle.values)
    plt.show()

Now discrete functions of temperature:

feature_suffixes_disc = ['_sum_dd32', '_sum_dd50','_norm_dd_32', '_norm_dd_50','_rolling_freeze', '_rolling_chill',]
for suffix in feature_suffixes_disc:
    cv = df[[col for col in df.columns if col.endswith(suffix)]]
    plt.figure(figsize=(10, 3))
    plt.xlabel('Number of Days Till Bloom')
    plt.xticks(np.arange(0, 115, 5))
    plt.title('Patterns of ' + suffix)
    plt.ylabel(suffix)
    for year in cv.columns:
        bloom_cycle = df[year]
        plt.plot(bloom_cycle.index,bloom_cycle.values,
                 marker='o', linestyle='-', markersize=4, label=year
)
    plt.show()

Prediction

Now let’s define some methods that will help us experiment and predict on data.

# DEFINE METHODS FOR PREDICTING BLOOM FOR 2025

def extract_historical_feature_sets(df, feature_suffixes):
    """
    Returns a dict where keys are feature suffixes, and values are DF of those suffixes.
    """
    feature_sets = {}
    for suffix in feature_suffixes:
        historical_features = df[[col for col in df.columns if col.endswith(suffix)]]
        feature_sets[suffix] = historical_features
    return feature_sets

def predict_from_arrays(historical_feature_sets, test_sequences):
    """
    - historical_feature_sets: dict where keys are feature suffixes, and values are DF of those suffixes
    - test_sequences: dict where keys are feature suffixes, values are test numpy arrays.
        
    Returns:
        - results: dict with best-matching year, best days_till_bloom index, and distance for each feature type.
    """
    results = {}
    
    for suffix, test_seq in test_sequences.items():
        best_match = None
        best_distance = float('inf')
        best_days_till_bloom = None

        historical_data = historical_feature_sets[suffix]

        for year in historical_data.columns:
            hist_seq = historical_data[year].values.flatten()
                        
            # FastDTW over a sliding window
            for days_till_bloom in range(len(hist_seq) - len(test_seq) + 1):
                window = hist_seq[days_till_bloom: days_till_bloom + len(test_seq)]
                distance, _ = fastdtw(window, test_seq, dist=2)
                
                if distance < best_distance:
                    best_distance = distance
                    best_match = year
                    best_days_till_bloom = days_till_bloom
        
        results[suffix] = (best_match, best_days_till_bloom, best_distance)
        #print(f"Feature: {suffix}, Best Match: {best_match}, Days_till_bloom: {best_days_till_bloom}, Distance: {best_distance}")

    return results

# feature suffixes
feature_suffixes = ['_sum_sq', '_sum_log', '_sum_dd32', '_sum_dd50', '_sum_last14', 
    '_norm_dd_32', '_norm_dd_50','_rolling_freeze', '_rolling_chill', '_rolling_temp_change'
]

# historical feature sets
historical_feature_sets = extract_historical_feature_sets(df, feature_suffixes)

Here, the prediction function uses FastDTW to compare vectors of time series. It finds the lowest Euclidean norm (L2) between a given input and a historical data. This slice of temperature input is thought to be “sliding” across the time, checking for patterns and attempting to match it to a historical bloom cycle. After completing the computations, it returns the lowest L2 norm and what the implied days_till_bloom is. This can be used to predict where we currently are in the cycle, which intuitively allows us to estimate when it will bloom. This process was used due to the fact that future temperature data would not need to be modeled or predicted.

# BUILDING 2025 TEMPERATURE SET DATAFRAME FOR 5 LOCATIONS

wdc_weather2025 = pd.read_csv('data/wdc_weather2025.csv')
liestal_weather2025 = pd.read_csv('data/liestal_weather2025.csv')
kyoto_weather2025 = pd.read_csv('data/kyoto_weather2025.csv')
vancouver_weather2025 = pd.read_csv('data/vancouver_weather2025.csv')
nyc_weather2025 = pd.read_csv('data/nyc_weather2025.csv')

wdc_weather2025['Date'] = pd.to_datetime(wdc_weather2025['Date'])
liestal_weather2025['Date'] = pd.to_datetime(liestal_weather2025['Date'])
kyoto_weather2025['Date'] = pd.to_datetime(kyoto_weather2025['Date'])
vancouver_weather2025['Date'] = pd.to_datetime(vancouver_weather2025['Date'])
nyc_weather2025['Date'] = pd.to_datetime(nyc_weather2025['Date'])

wdc_weather2025.name = "Washington DC"
liestal_weather2025.name = "Liestal"
kyoto_weather2025.name = "Kyoto"
vancouver_weather2025.name = "Vancouver"
nyc_weather2025.name = "NYC"

all_locations = [wdc_weather2025,liestal_weather2025, kyoto_weather2025, vancouver_weather2025, nyc_weather2025]

This function computes the temperature prediction for all 5 locations:

data = []
for loc in all_locations:
    temp_2025_dc = loc.sort_values(by='Date', ascending=False)[['Temp_avg']]
    temp_2025_dc.columns=['2025_b']
    for col in temp_2025_dc.columns:
        if col.endswith('_b'):
            temp_2025_dc[f'{col}_sum_sq'] = temp_2025_dc[col][::-1].pow(2).rolling(window=21, min_periods=1).sum()[::-1]
            temp_2025_dc[f'{col}_sum_log'] = temp_2025_dc[col][::-1].apply(lambda x: np.log(x + 1)).rolling(window=21, min_periods=1).sum()[::-1]
            temp_2025_dc[f'{col}_sum_dd32'] = (temp_2025_dc[col][::-1] > 32).rolling(window=21, min_periods=1).sum()[::-1]
            temp_2025_dc[f'{col}_sum_dd50'] = (temp_2025_dc[col][::-1] > 50).rolling(window=21, min_periods=1).sum()[::-1]
            temp_2025_dc[f'{col}_sum_last14'] = temp_2025_dc[col][::-1].rolling(window=14, min_periods=1).sum()[::-1]
            temp_2025_dc[f'{col}_norm_dd_32'] = ((temp_2025_dc[col] > 32).astype(int) * 1 + (temp_2025_dc[col] <= 32).astype(int) * -1).rolling(window=21, min_periods=1).sum()[::-1]
            temp_2025_dc[f'{col}_norm_dd_50'] = ((temp_2025_dc[col] > 50).astype(int) * 1 + (temp_2025_dc[col] <= 50).astype(int) * -1).rolling(window=21, min_periods=1).sum()[::-1]
            temp_2025_dc[f'{col}_rolling_freeze'] = (temp_2025_dc[col][::-1] < 32).rolling(window=21, min_periods=1).sum()[::-1]
            temp_2025_dc[f'{col}_rolling_chill'] = ((temp_2025_dc[col][::-1] >= 32) & (temp_2025_dc[col][::-1] < 50)).rolling(window=21, min_periods=1).sum()[::-1]
            temp_2025_dc[f'{col}_rolling_temp_change'] = temp_2025_dc[col][::-1].diff(periods=7)[::-1]
    temp_2025_dc = temp_2025_dc.head(50)

    all_results = []
    intervals = [10,20,30,40]

    for interval in intervals:
        temp_2025_dc_df = temp_2025_dc.head(interval)

        test_sequences = {
            '_sum_sq': np.array(temp_2025_dc_df['2025_b_sum_sq']),
            '_sum_log': np.array(temp_2025_dc_df['2025_b_sum_log']),
            '_sum_dd32': np.array(temp_2025_dc_df['2025_b_sum_dd32']),
            '_sum_dd50': np.array(temp_2025_dc_df['2025_b_sum_dd50']),
            '_sum_last14': np.array(temp_2025_dc_df['2025_b_sum_last14']),
            '_norm_dd_32': np.array(temp_2025_dc_df['2025_b_norm_dd_32']),
            '_norm_dd_50': np.array(temp_2025_dc_df['2025_b_norm_dd_50']),
            '_rolling_freeze': np.array(temp_2025_dc_df['2025_b_rolling_freeze']),
            '_rolling_chill': np.array(temp_2025_dc_df['2025_b_rolling_chill']),
            '_rolling_temp_change': np.array(temp_2025_dc_df['2025_b_rolling_temp_change']),

        }

        all_results.append((interval, predict_from_arrays(historical_feature_sets, test_sequences)))

    #all_results
    print("Results for " + loc.name)

    for result in all_results:
        print(result)

    arrays = []
    for j in range(len(feature_suffixes)):
        arrays.append([])

    for i in range(4):
        for j in range(len(feature_suffixes)):
            arrays[j].append(all_results[i][1][feature_suffixes[j]][1])
    
    data.append((loc.name, arrays))
    
    print("----- ----- -----")

Results for Washington DC
(10, {'_sum_sq': ('1991_b_sum_sq', 24, 3852.680000000011), '_sum_log': ('2013_b_sum_log', 41, 1.0494271412236884), '_sum_dd32': ('1991_b_sum_dd32', 0, 0.0), '_sum_dd50': ('1992_b_sum_dd50', 31, 0.0), '_sum_last14': ('2005_b_sum_last14', 23, 54.799999999999955), '_norm_dd_32': ('1994_b_norm_dd_32', 0, 0.0), '_norm_dd_50': ('2018_b_norm_dd_50', 4, 0.0), '_rolling_freeze': ('1991_b_rolling_freeze', 0, 0.0), '_rolling_chill': ('1999_b_rolling_chill', 34, 0.0), '_rolling_temp_change': ('1997_b_rolling_temp_change', 59, 24.800000000000008)})
(20, {'_sum_sq': ('2013_b_sum_sq', 37, 13526.310000000023), '_sum_log': ('2005_b_sum_log', 33, 5.482217935625627), '_sum_dd32': ('1991_b_sum_dd32', 6, 0.0), '_sum_dd50': ('1998_b_sum_dd50', 17, 0.0), '_sum_last14': ('2024_b_sum_last14', 46, 127.79999999999995), '_norm_dd_32': ('1994_b_norm_dd_32', 0, 2.0), '_norm_dd_50': ('2018_b_norm_dd_50', 4, 2.0), '_rolling_freeze': ('1991_b_rolling_freeze', 6, 0.0), '_rolling_chill': ('2001_b_rolling_chill', 55, 4.0), '_rolling_temp_change': ('2011_b_rolling_temp_change', 38, 71.59999999999997)})
(30, {'_sum_sq': ('1991_b_sum_sq', 34, 18253.010000000024), '_sum_log': ('2009_b_sum_log', 22, 8.464229549028587), '_sum_dd32': ('1991_b_sum_dd32', 1, 7.0), '_sum_dd50': ('1991_b_sum_dd50', 36, 0.0), '_sum_last14': ('2009_b_sum_last14', 23, 191.9999999999996), '_norm_dd_32': ('1991_b_norm_dd_32', 1, 7.0), '_norm_dd_50': ('1993_b_norm_dd_50', 2, 5.0), '_rolling_freeze': ('1991_b_rolling_freeze', 1, 7.0), '_rolling_chill': ('1997_b_rolling_chill', 27, 4.0), '_rolling_temp_change': ('2005_b_rolling_temp_change', 33, 124.59999999999995)})
(40, {'_sum_sq': ('1991_b_sum_sq', 35, 32586.250000000022), '_sum_log': ('2005_b_sum_log', 22, 11.476110915306847), '_sum_dd32': ('1999_b_sum_dd32', 40, 14.0), '_sum_dd50': ('1991_b_sum_dd50', 36, 0.0), '_sum_last14': ('2005_b_sum_last14', 24, 321.99999999999966), '_norm_dd_32': ('2009_b_norm_dd_32', 0, 8.0), '_norm_dd_50': ('2001_b_norm_dd_50', 0, 6.0), '_rolling_freeze': ('1999_b_rolling_freeze', 40, 15.0), '_rolling_chill': ('1997_b_rolling_chill', 27, 9.0), '_rolling_temp_change': ('2019_b_rolling_temp_change', 20, 182.9)})
----- ----- -----
Results for Liestal
(10, {'_sum_sq': ('2002_b_sum_sq', 66, 2793.6500000000015), '_sum_log': ('2004_b_sum_log', 29, 0.779998276216844), '_sum_dd32': ('1997_b_sum_dd32', 27, 0.0), '_sum_dd50': ('1991_b_sum_dd50', 53, 0.0), '_sum_last14': ('2002_b_sum_last14', 69, 30.700000000000045), '_norm_dd_32': ('1992_b_norm_dd_32', 0, 0.0), '_norm_dd_50': ('1996_b_norm_dd_50', 10, 0.0), '_rolling_freeze': ('1997_b_rolling_freeze', 27, 0.0), '_rolling_chill': ('2013_b_rolling_chill', 31, 0.0), '_rolling_temp_change': ('2001_b_rolling_temp_change', 18, 26.899999999999988)})
(20, {'_sum_sq': ('1999_b_sum_sq', 18, 10711.30000000001), '_sum_log': ('2008_b_sum_log', 45, 2.978655652284772), '_sum_dd32': ('1997_b_sum_dd32', 27, 0.0), '_sum_dd50': ('1993_b_sum_dd50', 58, 0.0), '_sum_last14': ('1998_b_sum_last14', 43, 80.00000000000034), '_norm_dd_32': ('2002_b_norm_dd_32', 0, 2.0), '_norm_dd_50': ('1992_b_norm_dd_50', 4, 4.0), '_rolling_freeze': ('1997_b_rolling_freeze', 27, 0.0), '_rolling_chill': ('2013_b_rolling_chill', 29, 3.0), '_rolling_temp_change': ('2013_b_rolling_temp_change', 23, 55.79999999999997)})
(30, {'_sum_sq': ('2013_b_sum_sq', 28, 19976.200000000015), '_sum_log': ('2008_b_sum_log', 22, 7.369746634481103), '_sum_dd32': ('1997_b_sum_dd32', 24, 0.0), '_sum_dd50': ('1993_b_sum_dd50', 48, 0.0), '_sum_last14': ('1998_b_sum_last14', 36, 213.60000000000014), '_norm_dd_32': ('2002_b_norm_dd_32', 0, 2.0), '_norm_dd_50': ('1993_b_norm_dd_50', 4, 4.0), '_rolling_freeze': ('1997_b_rolling_freeze', 23, 0.0), '_rolling_chill': ('2013_b_rolling_chill', 26, 7.0), '_rolling_temp_change': ('2013_b_rolling_temp_change', 23, 99.09999999999997)})
(40, {'_sum_sq': ('2005_b_sum_sq', 18, 17272.089999999997), '_sum_log': ('2005_b_sum_log', 18, 10.720291321811246), '_sum_dd32': ('1997_b_sum_dd32', 17, 4.0), '_sum_dd50': ('1994_b_sum_dd50', 22, 0.0), '_sum_last14': ('2005_b_sum_last14', 22, 395.2999999999997), '_norm_dd_32': ('2002_b_norm_dd_32', 0, 8.0), '_norm_dd_50': ('1994_b_norm_dd_50', 14, 7.0), '_rolling_freeze': ('1997_b_rolling_freeze', 27, 3.0), '_rolling_chill': ('2013_b_rolling_chill', 16, 16.0), '_rolling_temp_change': ('2013_b_rolling_temp_change', 21, 147.8)})
----- ----- -----
Results for Kyoto
(10, {'_sum_sq': ('1997_b_sum_sq', 35, 3806.989999999998), '_sum_log': ('1997_b_sum_log', 35, 0.9963614374403846), '_sum_dd32': ('1997_b_sum_dd32', 0, 0.0), '_sum_dd50': ('1991_b_sum_dd50', 53, 0.0), '_sum_last14': ('2013_b_sum_last14', 31, 29.999999999999773), '_norm_dd_32': ('1992_b_norm_dd_32', 0, 0.0), '_norm_dd_50': ('1996_b_norm_dd_50', 10, 0.0), '_rolling_freeze': ('1997_b_rolling_freeze', 0, 0.0), '_rolling_chill': ('1998_b_rolling_chill', 38, 0.0), '_rolling_temp_change': ('2024_b_rolling_temp_change', 68, 21.599999999999994)})
(20, {'_sum_sq': ('1991_b_sum_sq', 32, 11540.369999999988), '_sum_log': ('2012_b_sum_log', 48, 2.9421979249935077), '_sum_dd32': ('1997_b_sum_dd32', 0, 0.0), '_sum_dd50': ('1991_b_sum_dd50', 53, 0.0), '_sum_last14': ('2013_b_sum_last14', 31, 54.19999999999993), '_norm_dd_32': ('1995_b_norm_dd_32', 0, 0.0), '_norm_dd_50': ('1992_b_norm_dd_50', 4, 4.0), '_rolling_freeze': ('1997_b_rolling_freeze', 0, 0.0), '_rolling_chill': ('1998_b_rolling_chill', 37, 1.0), '_rolling_temp_change': ('2013_b_rolling_temp_change', 28, 59.70000000000001)})
(30, {'_sum_sq': ('1999_b_sum_sq', 19, 20571.660000000007), '_sum_log': ('2012_b_sum_log', 47, 4.449839565681472), '_sum_dd32': ('1998_b_sum_dd32', 17, 0.0), '_sum_dd50': ('1992_b_sum_dd50', 45, 0.0), '_sum_last14': ('1998_b_sum_last14', 37, 136.39999999999986), '_norm_dd_32': ('1997_b_norm_dd_32', 0, 0.0), '_norm_dd_50': ('1993_b_norm_dd_50', 4, 4.0), '_rolling_freeze': ('1998_b_rolling_freeze', 17, 0.0), '_rolling_chill': ('1998_b_rolling_chill', 28, 11.0), '_rolling_temp_change': ('2001_b_rolling_temp_change', 1, 87.29999999999998)})
(40, {'_sum_sq': ('1999_b_sum_sq', 30, 51632.58000000003), '_sum_log': ('1999_b_sum_log', 30, 11.255820258773852), '_sum_dd32': ('1998_b_sum_dd32', 26, 0.0), '_sum_dd50': ('1993_b_sum_dd50', 12, 0.0), '_sum_last14': ('1999_b_sum_last14', 14, 332.0999999999991), '_norm_dd_32': ('1997_b_norm_dd_32', 0, 0.0), '_norm_dd_50': ('1993_b_norm_dd_50', 4, 4.0), '_rolling_freeze': ('1998_b_rolling_freeze', 17, 0.0), '_rolling_chill': ('1998_b_rolling_chill', 19, 29.0), '_rolling_temp_change': ('1998_b_rolling_temp_change', 18, 118.39999999999992)})
----- ----- -----
Results for Vancouver
(10, {'_sum_sq': ('1999_b_sum_sq', 69, 3539.059999999994), '_sum_log': ('2004_b_sum_log', 33, 1.8825529727395605), '_sum_dd32': ('1991_b_sum_dd32', 52, 0.0), '_sum_dd50': ('1991_b_sum_dd50', 53, 0.0), '_sum_last14': ('2016_b_sum_last14', 24, 41.80000000000041), '_norm_dd_32': ('1992_b_norm_dd_32', 0, 0.0), '_norm_dd_50': ('1996_b_norm_dd_50', 10, 0.0), '_rolling_freeze': ('1991_b_rolling_freeze', 52, 0.0), '_rolling_chill': ('1991_b_rolling_chill', 53, 0.0), '_rolling_temp_change': ('2010_b_rolling_temp_change', 13, 21.50000000000001)})
(20, {'_sum_sq': ('2003_b_sum_sq', 16, 10951.32), '_sum_log': ('1997_b_sum_log', 42, 3.912478765509377), '_sum_dd32': ('1991_b_sum_dd32', 47, 2.0), '_sum_dd50': ('1991_b_sum_dd50', 53, 0.0), '_sum_last14': ('1993_b_sum_last14', 0, 205.39999999999975), '_norm_dd_32': ('1994_b_norm_dd_32', 0, 10.0), '_norm_dd_50': ('1992_b_norm_dd_50', 4, 4.0), '_rolling_freeze': ('1991_b_rolling_freeze', 47, 2.0), '_rolling_chill': ('2013_b_rolling_chill', 52, 0.0), '_rolling_temp_change': ('2010_b_rolling_temp_change', 13, 46.600000000000016)})
(30, {'_sum_sq': ('2010_b_sum_sq', 22, 15917.249999999978), '_sum_log': ('2010_b_sum_log', 24, 6.122629691371159), '_sum_dd32': ('2016_b_sum_dd32', 44, 11.0), '_sum_dd50': ('1992_b_sum_dd50', 45, 0.0), '_sum_last14': ('2022_b_sum_last14', 26, 221.99999999999977), '_norm_dd_32': ('1993_b_norm_dd_32', 0, 14.0), '_norm_dd_50': ('1993_b_norm_dd_50', 4, 4.0), '_rolling_freeze': ('2016_b_rolling_freeze', 44, 11.0), '_rolling_chill': ('2013_b_rolling_chill', 49, 5.0), '_rolling_temp_change': ('2021_b_rolling_temp_change', 45, 80.49999999999994)})
(40, {'_sum_sq': ('2010_b_sum_sq', 22, 20165.759999999973), '_sum_log': ('2010_b_sum_log', 24, 10.418869760718167), '_sum_dd32': ('2022_b_sum_dd32', 38, 9.0), '_sum_dd50': ('1993_b_sum_dd50', 12, 0.0), '_sum_last14': ('2010_b_sum_last14', 20, 231.6000000000007), '_norm_dd_32': ('1994_b_norm_dd_32', 9, 37.0), '_norm_dd_50': ('1993_b_norm_dd_50', 4, 4.0), '_rolling_freeze': ('2022_b_rolling_freeze', 38, 8.0), '_rolling_chill': ('2023_b_rolling_chill', 10, 5.0), '_rolling_temp_change': ('2010_b_rolling_temp_change', 12, 112.39999999999998)})
----- ----- -----
Results for NYC
(10, {'_sum_sq': ('2011_b_sum_sq', 42, 4121.240000000002), '_sum_log': ('2016_b_sum_log', 49, 0.9733589713594881), '_sum_dd32': ('2019_b_sum_dd32', 53, 0.0), '_sum_dd50': ('1991_b_sum_dd50', 53, 0.0), '_sum_last14': ('2022_b_sum_last14', 39, 58.699999999999875), '_norm_dd_32': ('1994_b_norm_dd_32', 41, 5.0), '_norm_dd_50': ('1996_b_norm_dd_50', 10, 0.0), '_rolling_freeze': ('2019_b_rolling_freeze', 53, 0.0), '_rolling_chill': ('2012_b_rolling_chill', 41, 0.0), '_rolling_temp_change': ('2012_b_rolling_temp_change', 52, 30.899999999999995)})
(20, {'_sum_sq': ('1993_b_sum_sq', 8, 14104.629999999997), '_sum_log': ('1996_b_sum_log', 20, 7.187871515130922), '_sum_dd32': ('2014_b_sum_dd32', 36, 6.0), '_sum_dd50': ('1991_b_sum_dd50', 53, 0.0), '_sum_last14': ('1993_b_sum_last14', 11, 122.59999999999991), '_norm_dd_32': ('1993_b_norm_dd_32', 0, 20.0), '_norm_dd_50': ('1992_b_norm_dd_50', 4, 4.0), '_rolling_freeze': ('2014_b_rolling_freeze', 36, 6.0), '_rolling_chill': ('2004_b_rolling_chill', 23, 5.0), '_rolling_temp_change': ('2023_b_rolling_temp_change', 40, 62.999999999999986)})
(30, {'_sum_sq': ('1993_b_sum_sq', 7, 20540.860000000004), '_sum_log': ('2010_b_sum_log', 50, 14.727854823709848), '_sum_dd32': ('2005_b_sum_dd32', 28, 7.0), '_sum_dd50': ('1992_b_sum_dd50', 45, 0.0), '_sum_last14': ('1993_b_sum_last14', 9, 280.19999999999965), '_norm_dd_32': ('1993_b_norm_dd_32', 0, 44.0), '_norm_dd_50': ('1993_b_norm_dd_50', 4, 4.0), '_rolling_freeze': ('2014_b_rolling_freeze', 36, 7.0), '_rolling_chill': ('2004_b_rolling_chill', 16, 7.0), '_rolling_temp_change': ('2005_b_rolling_temp_change', 33, 105.79999999999998)})
(40, {'_sum_sq': ('1993_b_sum_sq', 8, 23326.34), '_sum_log': ('1993_b_sum_log', 5, 22.32314793928245), '_sum_dd32': ('2005_b_sum_dd32', 31, 9.0), '_sum_dd50': ('1993_b_sum_dd50', 12, 0.0), '_sum_last14': ('1993_b_sum_last14', 11, 408.5999999999997), '_norm_dd_32': ('1996_b_norm_dd_32', 0, 36.0), '_norm_dd_50': ('1993_b_norm_dd_50', 4, 4.0), '_rolling_freeze': ('2009_b_rolling_freeze', 26, 15.0), '_rolling_chill': ('2004_b_rolling_chill', 9, 14.0), '_rolling_temp_change': ('2005_b_rolling_temp_change', 33, 175.00000000000003)})
----- ----- -----

After computing the estimated days_till_bloom for differing interval lengths, we can compare the estimated days. There are many outliers in the prediction, and the predictions can be wildly different at times. The chosen dates were picked in a non-rigorous, intuitive approach. While this may not be an ideal method for repeatability, for this specific prediction it might be okay. That’s because many of the high variance points are very unlikely; for example, points like 0 or 50 mean that the model thinks the trees are blooming today or in 50 days. We can reason intuitively that neither of these cases are/will happen, based on historical blooms. Thus, the median values we are left with are used to create the prediction and prediction intervals.

Table for Washington DC

columns represent the size of the testing interval.
Values represent the predicted days_till_bloom

Feature	10	20	30	40
_sum_sq	24	37	34	32
_sum_log	41	33	22	21
_sum_dd32	0	6	1	40
_sum_dd50	31	17	36	18

…and more values.

In the end, the final predictions for bloom dates (day of year) are:

location	prediction	lower	upper
washingtondc	88	84	91
liestal	85	83	90
kyoto	91	88	97
vancouver	88	84	94
newyorkcity	95	93	97

Translating from day of year to actual dates (in the year 2025):

location	prediction	lower	upper
washingtondc	MAR 29	MAR 25	APR 1
liestal	MAR 26	MAR 24	MAR 31
kyoto	APR 1	MAR 29	APR 7
vancouver	MAR 29	MAR 25	APR 4
newyorkcity	APR 5	APR 3	APR 7

Limitations

While the initial results seem like they have potential, it is important to mention the many limitations of this modeling.

Data set only trained on Washington DC temperatures; may not be comparable to other countries
Training on more years past 1990
Differing cycle lengths may impact prediction
Low historical training data for places like NYC
Different breeds of cherry blossoms might have different cycles and cycle patterns
Not accounting for special impactful events (storms, dry spells, etc)
Likely better results using other variables alongside temperature (emissions, sunlight, rainfall)
Predictions and predictions intervals not chosen using empirical and repeatable method