AI Breakdown
The podcast where we use AI to breakdown the recent AI papers and provide simplified explanations of intricate AI topics for educational purposes. The content presented here is generated automatically by utilizing LLM and text to speech technologies. While every effort is made to ensure accuracy, any potential misrepresentations or inaccuracies are unintentional due to evolving technology. We value your feedback to enhance our podcast and provide you with the best possible learning experience.
Episodes

Sunday May 07, 2023
Sunday May 07, 2023
In this episode we discuss Shakes on a Plane: Unsupervised Depth Estimation
by Authors:
- Ilya Chugunov
- Yuxuan Zhang
- Felix Heide
Affiliation:
- Princeton University. The paper discusses a new method for recovering high-quality scene depth from long-burst sequences captured by mobile burst photography pipelines. The researchers investigate using natural hand tremor to obtain enough parallax information to recover scene depth. They introduce a test-time optimization approach that simultaneously estimates scene depth and camera motion by fitting a neural RGB-D representation to long-burst data. The method uses a plane plus depth model, which is trained end-to-end and performs coarse-to-fine refinement by controlling which multi-resolution volume features the network has access to at what time during training. The results demonstrate geometrically accurate depth reconstructions with no additional hardware or separate data pre-processing and pose-estimation steps.

Saturday May 06, 2023
Saturday May 06, 2023
In this episode we discuss Visual Programming: Compositional visual reasoning without training by Authors: Tanmay Gupta and Aniruddha Kembhavi Affiliation: - PRIOR @ Allen Institute for AI. The paper introduces VISPROG, a neuro-symbolic approach to solving complex visual tasks based on natural language instructions. The system generates python-like modular programs that are executed to produce the solution and a comprehensive rationale. The approach avoids the need for task-specific training and instead uses the in-context learning ability of large language models. The paper demonstrates the flexibility of VISPROG on four diverse tasks, including image editing and factual knowledge object tagging, and shows its potential to expand AI systems to perform complex tasks.

Saturday May 06, 2023
Saturday May 06, 2023
In this episode we discuss OmniObject3D: Large-Vocabulary 3D Object Dataset for by Tong Wu, Jiarui Zhang, Xiao Fu, Yuxin Wang, Jiawei Ren, Liang Pan, Wayne Wu, Lei Yang, Jiaqi Wang, Chen Qian, Dahua Lin, Ziwei Liu. The paper proposes OmniObject3D, a large vocabulary 3D object dataset containing 6,000 high-quality real-scanned objects in 190 daily categories with rich annotations. The dataset aims to facilitate the development of 3D perception, reconstruction, and generation in the real world and is evaluated on four benchmarks: robust 3D perception, novel-view synthesis, neural surface reconstruction, and 3D object generation. The extensive studies on these benchmarks reveal new observations, challenges, and opportunities for future research in realistic 3D vision.

Saturday May 06, 2023
Saturday May 06, 2023
In this episode we discuss What Can Human Sketches Do for Object Detection? by Authors: - Pinaki Nath Chowdhury - Ayan Kumar Bhunia - Aneeshan Sain - Subhadeep Koley - Tao Xiang - Yi-Zhe Song Affiliation: SketchX, CVSSP, University of Surrey, United Kingdom. The paper proposes a new object detection framework that utilizes sketches to detect objects. It is the first attempt to cultivate the expressiveness of sketches for the task of object detection, with instance-aware and part-aware detection capabilities. The model is designed to work without knowing the category of objects beforehand and without requiring bounding boxes or class labels. The framework combines an existing sketch-based image retrieval (SBIR) model with the generalization ability of CLIP to build highly generalizable sketch and photo encoders that can be adapted for object detection. The proposed framework outperforms both supervised and weakly-supervised object detectors on zero-shot setups in standard object detection datasets like PASCAL-VOC and MS-COCO.

Saturday May 06, 2023
Saturday May 06, 2023
In this episode we discuss Efficient Multimodal Fusion via Interactive Prompting
by Authors:
- Yaowei Li
- Ruijie Quan
- Linchao Zhu
- Yi Yang
Affiliations:
- Yaowei Li: ReLER, AAII, University of Technology Sydney
- Ruijie Quan, Linchao Zhu, Yi Yang: CCAI, Zhejiang University
Contact information:
- Yaowei Li: yaowei.li@uts.edu.au
- Ruijie Quan, Linchao Zhu, Yi Yang: {quanruijie, zhulinchao, yangyics}@zju.edu.cn. The paper proposes an efficient and flexible multimodal fusion method, called PMF, for fusing unimodally pre-trained transformers. The proposed method disentangles vanilla prompts into three types to learn different optimizing objectives for multimodal learning. The method adds prompt vectors only on the deep layers of the unimodal transformers, significantly reducing the training memory usage. Experimental results show that the proposed method achieves comparable performance to several other multimodal finetuning methods with less than 3% trainable parameters and up to 66% saving of training memory usage.

Saturday May 06, 2023
Saturday May 06, 2023
In this episode we discuss Query-Dependent Video Representation
by Authors: WonJun Moon, Sangeek Hyun, SangUk Park, Dongchan Park, and Jae-Pil Heo.
Affiliation:
- WonJun Moon, Sangeek Hyun, and Jae-Pil Heo: Sungkyunkwan University.
- SangUk Park and Dongchan Park: Pyler.. The paper presents Query-Dependent DETR (QD-DETR), a detection transformer that is tailored for video moment retrieval and highlight detection (MR/HD). The authors identify a key issue with existing transformer-based models, which is their failure to fully exploit the information of a given query. To address this issue, QD-DETR introduces cross-attention layers to explicitly inject query context into video representation and trains the model on negative video-query pairs to encourage precise accordance between query-video pairs. QD-DETR outperforms state-of-the-art methods on several datasets.

Saturday May 06, 2023
Saturday May 06, 2023
In this episode we discuss Sketch2Saliency: Learning to Detect Salient Objects from Human Drawings
by Authors:
- Ayan Kumar Bhunia
- Subhadeep Koley
- Amandeep Kumar
- Aneeshan Sain
- Pinaki Nath Chowdhury
- Tao Xiang
- Yi-Zhe Song
Affiliations:
- SketchX, CVSSP, University of Surrey, United Kingdom
- iFlyTek-Surrey Joint Research Centre on Artificial Intelligence. The paper discusses the saliency of human sketches and proposes a method for using sketches as weak labels to detect salient objects in images. The method uses a photo-to-sketch generation model with a 2D attention mechanism to generate sequential sketch coordinates corresponding to a given visual photo. Attention maps accumulated across the time steps give rise to salient regions, and experiments show that the sketch-based saliency detection model performs competitively compared to the state-of-the-art.

Saturday May 06, 2023
Saturday May 06, 2023
In this episode we discuss Defending Against Patch-based Backdoor Attacks on Self-Supervised Learning
by Authors:
- Ajinkya Tejankar
- Maziar Sanjabi
- Qifan Wang
- Sinong Wang
- Hamed Firooz
- Hamed Pirsiavash
- Liang Tan
Affiliations:
- University of California, Davis (Ajinkya Tejankar, Hamed Pirsiavash)
- Meta AI (Maziar Sanjabi, Qifan Wang, Sinong Wang, Hamed Firooz, Liang Tan). The paper discusses a vulnerability of self-supervised learning to backdoor attacks through patch-based data poisoning. To defend against such attacks, the paper proposes a three-step defense pipeline involving training a model on the poisoned data, using a defense algorithm called PatchSearch to remove poisoned samples from the training set, and finally training a model on the cleaned-up training set. The results show that PatchSearch is an effective defense, outperforming baselines and state-of-the-art defense approaches. The code is available online.

Saturday May 06, 2023
Saturday May 06, 2023
In this episode we discuss Spatio-Temporal Pixel-Level Contrastive Learning-based Source-Free
by Authors:
- Shao-Yuan Lo
- Poojan Oza
- Sumanth Chennupati
- Alejandro Galindo
- Vishal M. Patel
Affiliations:
- Shao-Yuan Lo: Johns Hopkins University
- Poojan Oza: Amazon
- Sumanth Chennupati: Amazon
- Alejandro Galindo: Amazon
- Vishal M. Patel: Johns Hopkins University. The paper discusses unsupervised domain adaptation (UDA) of semantic segmentation, which transfers labeled source knowledge to an unlabeled target domain by accessing both the source and target data. However, access to the source data is often restricted or infeasible, making UDA less practical. To address this, recent works have explored Source-Free Domain Adaptation (SFDA), but current SFDA approaches use only image-level information, which is sub-optimal for video applications. The paper proposes a Spatio-Temporal Pixel-Level (STPL) contrastive learning method that uses spatio-temporal information to tackle the absence of source data better, achieving state-of-the-art performance on Video Semantic Segmentation (VSS) benchmarks compared to existing UDA and SFDA approaches.

Saturday May 06, 2023
Saturday May 06, 2023
In this episode we discuss Ego-Body Pose Estimation via Ego-Head Pose Estimation
by Authors:
- Jiaman Li
- C. Karen Liu†
- Jiajun Wu†
Affiliation:
- Stanford University
Contact:
- {jiamanli,karenliu,jiajunwu}@cs.stanford.edu. The paper proposes a new method, EgoEgo, for estimating 3D human motion from egocentric video sequences that addresses two challenges. The first challenge is that the user's body is often unobserved by the front-facing camera, and the second challenge is that collecting large-scale, high-quality datasets with paired egocentric videos and 3D human motions requires accurate motion capture devices. EgoEgo decomposes the problem into two stages connected by the head motion as an intermediate representation. It first estimates accurate head motion using SLAM and a learning approach and then generates multiple plausible full-body motions using conditional diffusion. The approach eliminates the need for paired egocentric video and human motion datasets, enabling the leverage of large-scale datasets separately. The EgoEgo model performs significantly better than the current state-of-the-art methods on both a synthetic dataset developed by the authors and real data.

Leverage AI to learn AI
Welcome to the AI Breakdown podcast, where we leverage the power of artificial intelligence to break down recent AI papers and provide simplified explanations of intricate AI topics for educational purposes. We're delighted to have you join us on this exciting journey into the world of artificial intelligence. Our goal is to make complex AI concepts accessible to everyone, and we achieve this by utilizing advanced AI technologies.
Hosts and Ownership: AI Breakdown is under the ownership and management of Megan Maghami and Ramin (Ray) Mehran. Although Megan and Ray lend their voices to the podcast, the content and audio are produced through automated means. Prior to publication, they carefully review the episodes created by AI. They leverage advanced AI technologies, including cutting-edge Large Language Models (LLM) and Text-to-Speech (TTS) systems, to generate captivating episodes. By harnessing these ingenious tools, they deliver enlightening explanations and in-depth analyses on various AI subjects.
Enhancing Your Learning Experience: Your feedback and engagement are crucial to us as we strive to enhance the podcast and provide you with the best possible learning experience. We encourage you to share your thoughts, suggestions, and questions related to our episodes. Together, we can build a vibrant community of AI enthusiasts, learners, and experts, fostering collaboration and knowledge sharing.
Technical Details and Episode Archives: For those interested in the technical aspects behind our AI-generated content, we will provide further insights in upcoming blog posts. Additionally, we will regularly update the blog with published episodes of the AI Breakdown podcast, ensuring convenient access to all our educational resources.



