# Introduction

This notebook fine-tunes BERT on an Arxix abstract classification dataset.

## Setup

In [1]:
!pip install -U transformers datasets evaluate accelerate
!pip install scikit-learn
!pip install tensorboard

[0m

## Imports

In [1]:
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    DataCollatorWithPadding,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    pipeline,
)

import evaluate
import glob
import numpy as np

## Hyperparameters

In [3]:
BATCH_SIZE = 32
NUM_PROCS = 32
LR = 0.00005
EPOCHS = 5
MODEL = 'bert-base-uncased'
OUT_DIR = 'arxiv_bert'

## Download the Dataset

In [4]:
train_dataset = load_dataset("ccdv/arxiv-classification", split='train')
valid_dataset = load_dataset("ccdv/arxiv-classification", split='validation')
test_dataset = load_dataset("ccdv/arxiv-classification", split='test')

In [5]:
print(train_dataset)
print(valid_dataset)
print(test_dataset)

Dataset({
    features: ['text', 'label'],
    num_rows: 28388
})
Dataset({
    features: ['text', 'label'],
    num_rows: 2500
})
Dataset({
    features: ['text', 'label'],
    num_rows: 2500
})


In [6]:
# Visualize a sample.
train_dataset[0]

{'text': 'Constrained Submodular Maximization via a\nNon-symmetric Technique\n\narXiv:1611.03253v1 [cs.DS] 10 Nov 2016\n\nNiv Buchbinder∗\n\nMoran Feldman†\n\nNovember 11, 2016\n\nAbstract\nThe study of combinatorial optimization problems with a submodular objective has attracted\nmuch attention in recent years. Such problems are important in both theory and practice because\ntheir objective functions are very general. Obtaining further improvements for many submodular\nmaximization problems boils down to finding better algorithms for optimizing a relaxation of\nthem known as the multilinear extension.\nIn this work we present an algorithm for optimizing the multilinear relaxation whose guarantee improves over the guarantee of the best previous algorithm (which was given by Ene\nand Nguyen (2016)). Moreover, our algorithm is based on a new technique which is, arguably,\nsimpler and more natural for the problem at hand. In a nutshell, previous algorithms for this\nproblem rely on symmet

## Dataset Information

In [7]:
id2label = {
    0: "math.AC",
    1: "cs.CV",
    2: "cs.AI",
    3: "cs.SY",
    4: "math.GR",
    5: "cs.CE",
    6: "cs.PL",
    7: "cs.IT",
    8: "cs.DS",
    9: "cs.NE",
    10: "math.ST"
}
label2id = {
    "math.AC": 0,
    "cs.CV": 1,
    "cs.AI": 2,
    "cs.SY": 3,
    "math.GR": 4,
    "cs.CE": 5,
    "cs.PL": 6,
    "cs.IT": 7,
    "cs.DS": 8,
    "cs.NE": 9,
    "math.ST": 10
}

## Tokenize the Dataset

In [8]:
tokenizer = AutoTokenizer.from_pretrained(MODEL)

In [9]:
# Helper function for preprocessing.
def preprocess_function(examples):
    return tokenizer(
        examples["text"], 
        truncation=True, 
    )

In [10]:
tokenized_train = train_dataset.map(
    preprocess_function, 
    batched=True,
    batch_size=BATCH_SIZE,
    num_proc=NUM_PROCS
)

Map (num_proc=32):   0%|          | 0/28388 [00:00<?, ? examples/s]

In [11]:
tokenized_valid = valid_dataset.map(
    preprocess_function, 
    batched=True,
    batch_size=BATCH_SIZE,
    num_proc=NUM_PROCS
)

Map (num_proc=32):   0%|          | 0/2500 [00:00<?, ? examples/s]

In [12]:
tokenized_test = test_dataset.map(
    preprocess_function, 
    batched=True,
    batch_size=BATCH_SIZE,
    num_proc=NUM_PROCS
)

Map (num_proc=32):   0%|          | 0/2500 [00:00<?, ? examples/s]

In [13]:
# Initialize data collator.
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

## Sample Tokenization Example

In [14]:
tokenized_sample = preprocess_function(train_dataset[0])

In [15]:
print(tokenized_sample)
print(f"Length of tokenized IDs: {len(tokenized_sample.input_ids)}")
print(f"Length of attention mask: {len(tokenized_sample.attention_mask)}")

{'input_ids': [101, 27570, 4942, 5302, 8566, 8017, 20446, 3989, 3081, 1037, 2512, 1011, 19490, 6028, 12098, 9048, 2615, 1024, 28769, 1012, 6021, 17788, 2509, 2615, 2487, 1031, 20116, 1012, 16233, 1033, 2184, 13292, 2355, 9152, 2615, 20934, 2818, 8428, 4063, 30125, 17866, 26908, 1526, 2281, 2340, 1010, 2355, 10061, 1996, 2817, 1997, 22863, 23207, 4818, 20600, 3471, 2007, 1037, 4942, 5302, 8566, 8017, 7863, 2038, 6296, 2172, 3086, 1999, 3522, 2086, 1012, 2107, 3471, 2024, 2590, 1999, 2119, 3399, 1998, 3218, 2138, 2037, 7863, 4972, 2024, 2200, 2236, 1012, 11381, 2582, 8377, 2005, 2116, 4942, 5302, 8566, 8017, 20446, 3989, 3471, 26077, 2015, 2091, 2000, 4531, 2488, 13792, 2005, 23569, 27605, 6774, 1037, 23370, 1997, 2068, 2124, 2004, 1996, 4800, 4179, 2906, 5331, 1012, 1999, 2023, 2147, 2057, 2556, 2019, 9896, 2005, 23569, 27605, 6774, 1996, 4800, 4179, 2906, 23370, 3005, 11302, 24840, 2058, 1996, 11302, 1997, 1996, 2190, 3025, 9896, 1006, 2029, 2001, 2445, 2011, 4372, 2063, 1998, 16577, 1

## Evaluation Metrics

In [16]:
accuracy = evaluate.load('accuracy')

In [17]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

## Model

In [18]:
model = AutoModelForSequenceClassification.from_pretrained(
    MODEL, 
    num_labels=11,
    id2label=id2label, 
    label2id=label2id,
)

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [19]:
# Total parameters and trainable parameters.
total_params = sum(p.numel() for p in model.parameters())
print(f"{total_params:,} total parameters.")
total_trainable_params = sum(
    p.numel() for p in model.parameters() if p.requires_grad)
print(f"{total_trainable_params:,} training parameters.")

109,490,699 total parameters.
109,490,699 training parameters.


## Training Arguments

In [20]:
training_args = TrainingArguments(
    output_dir=OUT_DIR,
    learning_rate=LR,
    per_device_train_batch_size=BATCH_SIZE,
    per_device_eval_batch_size=BATCH_SIZE,
    num_train_epochs=EPOCHS,
    weight_decay=0.01,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    save_total_limit=3,
    report_to='tensorboard',
    fp16=True
)

## Training

In [21]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_valid,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


In [22]:
history = trainer.train()

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Epoch,Training Loss,Validation Loss,Accuracy
1,0.8219,0.498572,0.8436
2,0.3947,0.407233,0.8716
3,0.2837,0.411557,0.874
4,0.2117,0.436728,0.8704
5,0.1696,0.46771,0.8636


## Evaluate

In [23]:
trainer.evaluate(tokenized_test)



{'eval_loss': 0.38107171654701233,
 'eval_accuracy': 0.8796,
 'eval_runtime': 14.9454,
 'eval_samples_per_second': 167.276,
 'eval_steps_per_second': 2.676,
 'epoch': 5.0}

## Inference

In [24]:
print(history.global_step)

2220


In [2]:
# model = AutoModelForSequenceClassification.from_pretrained(f"{OUT_DIR}/checkpoint-{history.global_step}")
model = AutoModelForSequenceClassification.from_pretrained(f"arxiv_bert/checkpoint-4440")

In [3]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
classify = pipeline(task='text-classification', model=model, tokenizer=tokenizer)

In [4]:
all_files = glob.glob('inference_data/*')
for file_name in all_files:
    file = open(file_name) 
    content = file.read()
    print(content)
    result = classify(content)
    print('PRED: ', result)
    print('GT: ', file_name.split('_')[-1].split('.txt')[0])
    print('\n')

For finitely generated modules M and N over a commutative Noetherian local ring R, we give various sufficient criteria for detecting freeness of M or N via vanishing of some Ext modules ExtiR(M,N) and finiteness of certain homological dimension of HomR(M,N). Some of our results provide partial progress towards answering a question of Ghosh-Takahashi and also generalize their main results in many ways, for instance, by reducing the number of vanishing. As some applications, we provide affirmative answers to two questions raised by Tony Se on n-semidualizing modules. In particular, we establish that for normal domains which satisfy Serre's condition (S3) and are locally Gorenstein in co-dimension two, the class of 1-semidualizing modules form a subgroup of the divisor class group. These two groups coincide when, in addition, the ring is locally regular in co-dimension two.
PRED:  [{'label': 'math.AC', 'score': 0.9966996312141418}]
GT:  math.ac


In this paper we produce the first known f