Data Science at Home
Episodes

Monday Nov 25, 2024
Humans vs. Bots: Are You Talking to a Machine Right Now? (Ep. 273)
Monday Nov 25, 2024
Monday Nov 25, 2024
In this episode of Data Science at Home, host Francesco Gadaleta dives deep into the evolving world of AI-generated content detection with experts Souradip Chakraborty, Ph.D. grad student at the University of Maryland, and Amrit Singh Bedi, CS faculty at the University of Central Florida.
Together, they explore the growing importance of distinguishing human-written from AI-generated text, discussing real-world examples from social media to news. How reliable are current detection tools like DetectGPT? What are the ethical and technical challenges ahead as AI continues to advance? And is the balance between innovation and regulation tipping in the right direction?
Tune in for insights on the future of AI text detection and the broader implications for media, academia, and policy.
Chapters
00:00 - Intro
00:23 - Guests: Souradip Chakraborty and Amrit Singh Bedi
01:25 - Distinguish Text Generation By AI
04:33 - Research on Safety and Alignment of Generative Model
06:01 - Tools to Detect Generated AI Text
11:28 - Water Marking
18:27 - Challenges in Detecting Large Documents Generated by AI
23:34 - Number of Tokens
26:22 - Adversarial Attack
29:01 - True Positive and False Positive of Detectors
31:01 - Limit of Technologies
41:01 - Future of AI Detection Techniques
46:04 - Closing Thought
Subscribe to our new YouTube channel https://www.youtube.com/@DataScienceatHome

Saturday Jan 14, 2023
Accelerating Perception Development with Synthetic Data (Ep. 214)
Saturday Jan 14, 2023
Saturday Jan 14, 2023
In this episode I am with Kevin McNamara, founder and CEO of Parallel Domain. We speak about a very effective method to generate synthetic data that is currently in production at Parallel Domain.
Enjoy the show!
References
Parallel Domain Synthetic Data Improves Cyclist Detection (blog post):
https://paralleldomain.com/parallel-domain-synthetic-data-improves-cyclist-detection/
Beating the State of the Art in Object Tracking with Synthetic Data:
https://paralleldomain.com/beating-the-state-of-the-art-in-object-tracking-with-synthetic-data/
Parallel Domain Open Synthetic Dataset:
https://paralleldomain.com/open-datasets/bicycle-detection
How Toyota Research Institute Trains Better Computer Vision Models with PD Synthetic Data (interview):
https://www.youtube.com/watch?v=QIYttoVxf2w
Career Opportunities:
https://paralleldomain.com/careers

Saturday Aug 29, 2020
Testing in machine learning: generating tests and data (Ep. 117)
Saturday Aug 29, 2020
Saturday Aug 29, 2020
In this episode I speak with Adam Leon Smith, CTO at DragonFly and expert in testing strategies for software and machine learning.
On September 15th there will be a live@Manning Rust conference. In one Rust-full day you will attend many talks about what's special about rust, building high performance web services or video game, about web assembly and much more.If you want to meet the tribe, tune in september 15th to the live@manning rust conference.

Sunday Jul 26, 2020
GPT-3 cannot code (and never will) (Ep. 114)
Sunday Jul 26, 2020
Sunday Jul 26, 2020
The hype around GPT-3 is alarming and gives and provides us with the awful picture of people misunderstanding artificial intelligence. In response to some comments that claim GPT-3 will take developers' jobs, in this episode I express some personal opinions about the state of AI in generating source code (and in particular GPT-3).
If you have comments about this episode or just want to chat, come join us on the official Discord channel.
This episode is supported by Amethix Technologies.
Amethix works to create and maximize the impact of the world’s leading corporations, startups, and nonprofits, so they can create a better future for everyone they serve. They are a consulting firm focused on data science, machine learning, and artificial intelligence.

Monday Nov 18, 2019
How to improve the stability of training a GAN (Ep. 88)
Monday Nov 18, 2019
Monday Nov 18, 2019
Generative Adversarial Networks or GANs are very powerful tools to generate data. However, training a GAN is not easy. More specifically, GANs suffer of three major issues such as instability of the training procedure, mode collapse and vanishing gradients.
In this episode I not only explain the most challenging issues one would encounter while designing and training Generative Adversarial Networks. But also some methods and architectures to mitigate them. In addition I elucidate the three specific strategies that researchers are considering to improve the accuracy and the reliability of GANs.
The most tedious issues of GANs
Convergence to equilibrium
A typical GAN is formed by at least two networks: a generator G and a discriminator D. The generator's task is to generate samples from random noise. In turn, the discriminator has to learn to distinguish fake samples from real ones. While it is theoretically possible that generators and discriminators converge to a Nash Equilibrium (at which both networks are in their optimal state), reaching such equilibrium is not easy.
Vanishing gradients
Moreover, a very accurate discriminator would push the loss function towards lower and lower values. This in turn, might cause the gradient to vanish and the entire network to stop learning completely.
Mode collapse
Another phenomenon that is easy to observe when dealing with GANs is mode collapse. That is the incapability of the model to generate diverse samples. This in turn, leads to generated data that are more and more similar to the previous ones. Hence, the entire generated dataset would be just concentrated around a particular statistical value.
The solution
Researchers have taken into consideration several approaches to overcome such issues. They have been playing with architectural changes, different loss functions and game theory.
Listen to the full episode to know more about the most effective strategies to build GANs that are reliable and robust. Don't forget to join the conversation on our new Discord channel. See you there!
![[RB] How to generate very large images with GANs (Ep. 85)](https://pbcdn1.podbean.com/imglogo/ep-logo/pbblog1799802/data_science_at_home_podcast_cover_300x300.png)
Monday Nov 04, 2019
[RB] How to generate very large images with GANs (Ep. 85)
Monday Nov 04, 2019
Monday Nov 04, 2019
Join the discussion on our Discord server
In this episode I explain how a research group from the University of Lubeck dominated the curse of dimensionality for the generation of large medical images with GANs. The problem is not as trivial as it seems. Many researchers have failed in generating large images with GANs before. One interesting application of such approach is in medicine for the generation of CT and X-ray images.Enjoy the show!
References
Multi-scale GANs for Memory-efficient Generation of High Resolution Medical Images https://arxiv.org/abs/1907.01376