AI Evaluation
arXiv:2602.16042
Patent Pending
AI-CARE
Carbon-Aware Reporting Evaluation for ML Models
A reporting-centric evaluation framework I developed that elevates energy and carbon emissions to first-class evaluation quantities. Enables transparent, reproducible comparison of ML models under fixed experimental conditions — without modifying architectures.
Key Contributions
- Carbon–Accuracy Tradeoff Curves (CATC) for visual benchmarking
- Scalar Carbon-Aware Score (SCAS) for single-metric model ranking
- Benchmarks MLP, CNN, Transformer, MLP-Mixer, MobileNetV2, ResNet-18
- Reproducible CSV outputs + publication-quality figures
PythonPyTorchCodeCarbonMatplotlibApache 2.0
Experimental Setup
- Adam · LR 1e-3 · Batch 64 · 10 epochs
- Grid carbon intensity: 400 gCO₂/kWh (fixed)
- CPU-only · Deterministic · Reproducible
- MNIST → Fashion-MNIST → CIFAR-10 → CIFAR-100 → ImageNet-100
LLM ResearcharXiv:2602.16042
AI-CARE LLM
Token vs Energy Billing Mismatch
Extends my AI-CARE framework with a pipeline studying the mismatch between token-based LLM pricing and real energy footprint. Uses real NRP-hosted models with a fixed 30-prompt suite for controlled comparison.
Highlights
- Token–energy correlation, variance & rank-shift analysis
- 4 NRP models: Qwen3-small, GPT-OSS, Qwen3, MiniMax-M2
- TikZ/PGFPlots figures for manuscript-ready output
PythonNRP NautilusOpenAI APITikZ
LLM AgentsNeurIPS Extension
Retrieval-Augmented Reflexion
RAR — Verbal RL with Episodic Memory
My extension of the Reflexion framework using Maximum Marginal Relevance (MMR) to retrieve semantically similar past trajectories from an episodic store as contrastive context for richer, failure-aware agent reflections.
Evaluated Strategies
- Baselines: Simple, ReAct, CoT+GT, Reflexion, ExpeL
- RAR (mine): semantic sim + error-class + MMR diversification
- HotPotQA (100×5), ALFWorld (134×10), HumanEval Hard (50)
PythonKubernetesFAISS/MMRNRP Nautilus
Vision-LanguageVideo Action Recognition
FlowCLIP
Optical-Flow Encodings for CLIP Video Action Recognition
Investigates how different optical-flow encodings augment CLIP for video action recognition, comparing flow-based motion representations against RGB-only baselines to study what temporal motion cues add to vision-language transfer.
PythonPyTorchCLIPVideo
Multimodal AIEarly Detection
Multimodal Precursor Detection
Fusing Heterogeneous Signals for Early Event Detection
A multimodal deep-learning pipeline that fuses heterogeneous signals (e.g., audio, text, and sensor streams) to detect precursor events early, enabling timely alerts and downstream decision support.
Highlights
- Multimodal fusion across audio, text, and sensor streams
- Early-warning detection enabling timely alerts & decision support
- Deep-learning pipeline with modular signal encoders
PythonPyTorchMultimodal Fusion