Device to Cloud Hackathon in Mexico City
Recently, CTO and VP of Engineering Jonathan Khoo had the opportunity to participate in a Microsoft-led event in Mexico City focused on device-to-cloud AI (aka "hybrid AI"). HIs role was to design and lead a hands-on lab that demonstrated how cloud and on-device AI can work together seamlessly.
The session followed excellent presentations from Azim Verjee (Global Black Belt North America, Surface devices) on Surface and NPUs, and from Beth Pan (Windows Partner & Developer Experiences) on the Windows Copilot Runtime.
Photo courtesy Azim Verjee
The lab showcased how the same AI tasks can be executed both in the cloud and locally—using the same ONNX model in both scenarios. For this, we built a chatbot that processes user input, classifies the intent using AI, and responds as though it were an employee in the appropriate department. The classification task was executed via the same ONNX model file in the cloud and on the device, and the chat response was generated by a large language model in the cloud (GPT-4o-mini) or a small model on the device (Phi Silica).
We began by discussing ONNX models and runtime, and how to use Jupyter notebooks and Python to perform inference with Hugging Face models—specifically the fine-tuned DeBERTa model by Moritz Laurer and its optimized ONNX version which we used for the zero-shot classification task.
The lab walked through converting a Hugging Face model to ONNX, running it on-device with the QNN Execution Provider to tap into NPU acceleration, and then deploying the pre-optimized version to Azure Machine Learning. We compared performance and results across three environments: the original HF model, the ONNX model on-device, and the ONNX model deployed in Azure.
Next, we built a sample WinUI3 application that used evaluators—logic that determines whether to use cloud or local AI based on real-world criteria like cost, device capability, connectivity, or privacy. In our lab, we focused on two: connectivity (falling back to local if the Azure endpoint was unavailable or slow) and privacy (keeping data local if the input contained sensitive patterns like account numbers). Similar to the Python code, our Windows app also used the QNN Execution Provider to harness the NPU.
We concluded the lab by testing the application live, showing how it intelligently shifted between cloud and local pipelines. It was exciting to show just how straightforward it can be to repurpose existing AI models for hybrid use cases—and to take advantage of modern NPU performance at the edge.
The full source code, Jupyter notebook, and slides are available at github.com/jonathankhootek/HelpChatLab_public. Make sure to check the README for setup instructions.
P.S. The Microsoft Mexico office is a stunner—full of artisanal crafts and massive murals. If you get the chance, ask for a tour of the outdoor gardens.
P.P.S. Grab a pastor taco (or three) at El Califa next door. You're welcome.