Adaptive linear mapping model (ALMM) Replication

S.Chou et al. Addressing Cold Start for Next-song Recommendation. In Proc. ACM Recsys 2016.

New Data Preparation

In the data_cleaning.ipynb notebook, the goal is to carefully prepare and clean the MIND news recommendation dataset before it is used for modeling. The process begins by loading the behaviors.tsv and news.tsv files, which contain user click histories and article information, respectively. Basic preprocessing steps are applied, including parsing tab-separated columns properly, handling missing values (such as users with no reading history), and separating out interactions where critical fields are missing. Special attention is given to computing the time differences between user interactions, allowing the system to filter interactions that occur within a short, meaningful window of time. Based on this, a cleaned dataset of interaction triplets — (user, last_news, next_news) — is created, focusing only on article transitions that happen within a reasonable timeframe (such as under 60 minutes). Finally, the cleaned interaction dataset is saved into triplets_under_60.csv, providing a high-quality, time-sensitive set of user behaviors that can later be used for training recommendation models like ALMM. This notebook ensures that only meaningful, temporally-close interactions are used for downstream training.

In build_feat_BERT.ipynb, the goal is to fine-tune a lightweight version of BERT, specifically prajjwal1/bert-mini, on the news dataset using a masked language modeling (MLM) objective. The notebook first loads the small BERT model and its tokenizer, then prepares the dataset by combining the title, abstract, category, and subcategory fields of each news article into rich textual inputs. These texts are then tokenized and passed into the BERT model, training it over several epochs so that it adapts to the specific language and style of the news articles. After fine-tuning, the updated BERT model is saved locally to the ./bert_news_finetuned_mini/ directory, making it ready for use. However, this fine-tuning notebook does not generate feature vectors directly. That task is handled separately in the final feature extraction notebook. In the feature extraction notebook, the saved fine-tuned BERT-mini model is loaded back, and all the news articles are reprocessed: each article’s fields are merged into a full text, batched into groups (e.g., 32 articles at a time), tokenized, and fed through the model. For each article, the resulting hidden states are mean-pooled into a single 256-dimensional vector. These vectors are collected into a dictionary keyed by news ID and finally saved as a .pkl file (news_id_to_feature_full_finetuned_bert.pkl) that contains all article embeddings. This .pkl file is what will be used for training the ALMM recommendation system. Thus, build_feat_BERT focuses on teaching BERT to understand the news, and the feature extraction notebook turns the fine-tuned BERT’s understanding into usable feature vectors for downstream recommendation.

In the build_feat.ipynb notebook, the goal is to construct a set of feature vectors for each news article using traditional text-based methods, rather than neural models like BERT. The process begins by reading the news.tsv dataset, extracting important fields such as title, abstract, category, and subcategory for each article. These fields are combined into a single text input to capture a broad description of the article’s content. Instead of passing the texts through a neural network, the notebook uses a TfidfVectorizer from scikit-learn, which transforms each article into a sparse high-dimensional vector based on the relative frequency of words and terms. The resulting vectors represent how important different words are in describing each article. After computing the TF-IDF vectors, they are reduced into a dictionary mapping each news_id to its corresponding feature vector. Finally, this dictionary is saved as a .pkl file (news_id_to_feature_advanced.pkl) that can later be loaded into the ALMM model. Thus, while build_feat_BERT.ipynb uses learned semantic representations from a language model, build_feat.ipynb relies on simpler, frequency-based text features to describe articles.

confirm_plks shows how to access the data inside the plks and confirms that the News Articles ID's exist in the plk and the csv file.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
feat_build		feat_build
old_data		old_data
.gitignore		.gitignore
ALMM.py		ALMM.py
Addressing Cold Start For next-article Recommendation.pdf		Addressing Cold Start For next-article Recommendation.pdf
README.md		README.md
new_ALMM.py		new_ALMM.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adaptive linear mapping model (ALMM) Replication

New Data Preparation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Adaptive linear mapping model (ALMM) Replication

New Data Preparation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages