Skip to content

omarOG1010/ALMM_Replication

Repository files navigation

Adaptive linear mapping model (ALMM) Replication

S.Chou et al. Addressing Cold Start for Next-song Recommendation. In Proc. ACM Recsys 2016.

New Data Preparation

In the data_cleaning.ipynb notebook, the goal is to carefully prepare and clean the MIND news recommendation dataset before it is used for modeling. The process begins by loading the behaviors.tsv and news.tsv files, which contain user click histories and article information, respectively. Basic preprocessing steps are applied, including parsing tab-separated columns properly, handling missing values (such as users with no reading history), and separating out interactions where critical fields are missing. Special attention is given to computing the time differences between user interactions, allowing the system to filter interactions that occur within a short, meaningful window of time. Based on this, a cleaned dataset of interaction triplets — (user, last_news, next_news) — is created, focusing only on article transitions that happen within a reasonable timeframe (such as under 60 minutes). Finally, the cleaned interaction dataset is saved into triplets_under_60.csv, providing a high-quality, time-sensitive set of user behaviors that can later be used for training recommendation models like ALMM. This notebook ensures that only meaningful, temporally-close interactions are used for downstream training.

In build_feat_BERT.ipynb, the goal is to fine-tune a lightweight version of BERT, specifically prajjwal1/bert-mini, on the news dataset using a masked language modeling (MLM) objective. The notebook first loads the small BERT model and its tokenizer, then prepares the dataset by combining the title, abstract, category, and subcategory fields of each news article into rich textual inputs. These texts are then tokenized and passed into the BERT model, training it over several epochs so that it adapts to the specific language and style of the news articles. After fine-tuning, the updated BERT model is saved locally to the ./bert_news_finetuned_mini/ directory, making it ready for use. However, this fine-tuning notebook does not generate feature vectors directly. That task is handled separately in the final feature extraction notebook. In the feature extraction notebook, the saved fine-tuned BERT-mini model is loaded back, and all the news articles are reprocessed: each article’s fields are merged into a full text, batched into groups (e.g., 32 articles at a time), tokenized, and fed through the model. For each article, the resulting hidden states are mean-pooled into a single 256-dimensional vector. These vectors are collected into a dictionary keyed by news ID and finally saved as a .pkl file (news_id_to_feature_full_finetuned_bert.pkl) that contains all article embeddings. This .pkl file is what will be used for training the ALMM recommendation system. Thus, build_feat_BERT focuses on teaching BERT to understand the news, and the feature extraction notebook turns the fine-tuned BERT’s understanding into usable feature vectors for downstream recommendation.

In the build_feat.ipynb notebook, the goal is to construct a set of feature vectors for each news article using traditional text-based methods, rather than neural models like BERT. The process begins by reading the news.tsv dataset, extracting important fields such as title, abstract, category, and subcategory for each article. These fields are combined into a single text input to capture a broad description of the article’s content. Instead of passing the texts through a neural network, the notebook uses a TfidfVectorizer from scikit-learn, which transforms each article into a sparse high-dimensional vector based on the relative frequency of words and terms. The resulting vectors represent how important different words are in describing each article. After computing the TF-IDF vectors, they are reduced into a dictionary mapping each news_id to its corresponding feature vector. Finally, this dictionary is saved as a .pkl file (news_id_to_feature_advanced.pkl) that can later be loaded into the ALMM model. Thus, while build_feat_BERT.ipynb uses learned semantic representations from a language model, build_feat.ipynb relies on simpler, frequency-based text features to describe articles.

confirm_plks shows how to access the data inside the plks and confirms that the News Articles ID's exist in the plk and the csv file.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors