Activity Feed

AI & ML interests

None defined yet.

Recent Activity

theo-michelย  updated a Space 1 day ago
Mistral-AI-Game-Jam/shyguys_2
theo-michelย  updated a Space 2 days ago
Mistral-AI-Game-Jam/shyguys_2
fracapuanoย  authored a paper 4 months ago
Robot Learning: A Tutorial
View all activity

MikeDoesย 
posted an update about 10 hours ago
view post
Post
74
Are you sure the open-source model you just downloaded is safe?

A recent paper on "Privacy Backdoors" reports a new vulnerability where pre-trained models can be poisoned before fine-tuning them. This is a serious challenge for everyone building on open-source AI.

Instead of just pointing out problems, we believe in finding better solutions. To understand this threat, the researchers needed to test their attack on realistic data structures. They needed a dataset that could effectively simulate a high-stakes privacy attack, and we're proud that our Ai4Privacy dataset was used to provide this crucial benchmark. The paper reports that for our complex dataset, the privacy leakage on a non-poisoned model was almost zero. After the backdoor attack, that number reportedly jumped to 87%.

Ai4Privacy dataset provided a realistic benchmark for their research. Our dataset, composed of synthetic identities, helped them demonstrate how a poisoned model could dramatically amplify privacy leakage.

This is why we champion open source: it enables the community to identify these issues and develop better, safer solutions together.
Kudos to the authors Yuxin Wen, Leo Marchyok, Sanghyun Hong, Jonas Geiping, Tom Goldstein, and Nicholas Carlini, University of Maryland and Google DeepMind.

๐Ÿ”— Read the research to understand this new challenge: https://arxiv.org/pdf/2404.01231

๐Ÿš€ Stay updated on the latest in privacy-preserving AIโ€”follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/
MikeDoesย 
posted an update 2 days ago
view post
Post
3530
A single lock on a door isn't enough. Real security is about layers.

The same is true for AI privacy. A new paper, "Whispered Tuning", offers a fantastic layered solution that aims to fortify LLMs against privacy infringements.

We're proud that the first, essential layer, a high-precision PII redaction model was built on the foundation of the Ai4Privacy/pii-65k dataset.

Our dataset provided the necessary training material for their initial anonymization step, which then enabled them to develop further innovations like differential privacy fine-tuning and output filtering. This is a win-win: our data helps create a solid base, and researchers build powerful, multi-stage privacy architectures on top of it.

Together, we're making AI safer.

๐Ÿ”— Read the full paper to see how a strong foundation enables a complete privacy solution: https://www.scirp.org/journal/paperinformation?paperid=130659

๐Ÿš€ Stay updated on the latest in privacy-preserving AIโ€”follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/
MikeDoesย 
posted an update 7 days ago
view post
Post
127
How can we teach a robot to understand the nuances of privacy in elderly care? It starts with teaching it to recognize sensitive data.

A new conceptual paper introduces "Privacy Agents" an AI designed to safeguard contextual integrity in care settings. To demonstrate that their innovative concept is feasible, the researchers from TU Wien needed to prove an AI could identify PII in a real-world transcript.

We're proud that the tool they used for this proof-of-concept was fine-tuned on the Ai4Privacy/pii-masking-200k dataset.

This is a perfect win-win: brilliant researchers are designing the future of privacy-aware robotics, and our open-source data helps provide the foundational tools to show it's possible. This is how conceptual breakthroughs become practical solutions.

๐Ÿ”— Check out their forward-thinking paper on the future of privacy in Human-Robot Interaction: http://hirschmanner.com/publication/privacy-hri-2024/privacy-hri-2024.pdf

๐Ÿš€ Stay updated on the latest in privacy-preserving AIโ€”follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/
MikeDoesย 
posted an update 9 days ago
view post
Post
1756
What happens when an LLM "forgets" your data? A new paper reports it might not be gone for good.

The "Janus Interface" paper details a new attack that could recover forgotten PII through fine-tuning APIs. This is a solution-oriented paper because it highlights a problem that needs fixing.

Testing such a high-stakes attack requires equally high-stakes data. The Ai4Privacy 300k dataset was a key part of their evaluation, providing a testbed for extracting sensitive Social Security Numbers. Our dataset, with its synthetic structured SSN data, helped the researchers at Indiana University, Stanford & CISPA, and others demonstrate that their attack works on more than just emails. It could affect highly sensitive personal identifiers.

We're excited to see our open-source dataset used in such cutting-edge security research. It's a win for the community when researchers can use our resources to stress-test the safety of modern AI systems. This work is a direct and explicit call for stronger protections on fine-tuning interfaces.

๐Ÿ”— This is why open data for security research is so important. Check out the full paper: https://arxiv.org/pdf/2310.15469

๐Ÿš€ Stay updated on the latest in privacy-preserving AIโ€”follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/
  • 1 reply
ยท
MikeDoesย 
posted an update 15 days ago
view post
Post
1028
Why choose between performance, privacy, and transparency when you can have all three?

We're highlighting a solution-oriented paper that introduces PRvL, an open-source toolkit for PII redaction. The interesting part, the researchers used the AI4Privacy-300K and AI4Privacy-500K datasets to train and benchmark their suite of models.

This is the power of open-source collaboration. We provide the comprehensive data foundation, and the community builds better solutions on top of it. It's a win for every organization when this research results in a powerful, free, and self-hostable tool that helps keep their data safe.

Big cheers to Leon Garza, Anantaa Kotal, Aritran Piplai, Lavanya Elluri, Prajit D., and Aman Chadha for pulling this off.

๐Ÿ”— Read the full paper to see their data-driven results and access the PRvL toolkit: https://arxiv.org/pdf/2508.05545

๐Ÿš€ Stay updated on the latest in privacy-preserving AIโ€”follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/

#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset
MikeDoesย 
posted an update 16 days ago
view post
Post
3020
How do you prove your new, specialized AI model is a better solution? You test it against the best.

That's why we were excited to see the new AdminBERT paper from researchers at Nantes Universitรฉ and others. To show the strength of their new model for French administrative texts, they compared it to the state-of-the-art generalist model, NERmemBERT.

The direct connection to our work is clear: NERmemBERT was trained on a combination of datasets, including the Pii-masking-200k dataset by Ai4Privacy.

This is a perfect win-win for the open-source community. Our foundational dataset helps create a strong, general-purpose benchmark, which in turn helps researchers prove the value of their specialized work. This is how we all get better.

๐Ÿ”— Great work by Thomas Sebbag, Solen Quiniou, Nicolas Stucky, and Emmanuel Morin on tackling a challenging domain! Check out their paper: https://aclanthology.org/2025.coling-main.27.pdf

๐Ÿš€ Stay updated on the latest in privacy-preserving AIโ€”follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/

#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset
MikeDoesย 
posted an update 21 days ago
view post
Post
234
The future of AI privacy isn't just in the cloud; it's on your device. But how do we build and validate these tools?

A new paper on "Rescriber" explores this with a tool that uses smaller LLMs for on-device anonymization. Building and validating such tools requires a strong data foundation. We're excited to see that the researchers used the Ai4Privacy open dataset to create their performance benchmarks.

This is our mission in action: providing the open-source data that helps innovators build and test better solutions that will give users more control over their privacy. It's a win for the community when our data helps prove the feasibility of on-device AI for data minimization, with reported user perceptions on par with state-of-the-art cloud models.

Shoutout to Jijie Zhou, Eryue Xu, Yaoyao Wu, and Tianshi Li on this one!

๐Ÿ”— Check out the research to see how on-device AI, powered by solid data, is changing the game: https://dl.acm.org/doi/pdf/10.1145/3706598.3713701

๐Ÿš€ Stay updated on the latest in privacy-preserving AIโ€”follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/

#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset
MikeDoesย 
posted an update 23 days ago
view post
Post
2320
Building powerful multilingual AI shouldn't mean sacrificing user privacy.

We're highlighting a solution-oriented report from researchers Sahana Naganandh, Vaibhav V, and Thenmozhi M at Vellore Institute of Technology that investigates this exact challenge. The direct connection to our mission is clear: the paper showcases the PII43K dataset as a privacy-preserving alternative to high-risk, raw multilingual data

The report notes that our dataset, with its structured anonymization, is a "useful option for privacy-centric AI applications." It's always a delight when academic research independently validates our data-first approach to solving real-world privacy problems.

This is how we build a safer AI future together.

๐Ÿ”— Read the full report here to learn more: https://assets.cureusjournals.com/artifacts/upload/technical_report/pdf/3689/20250724-59151-93w9ar.pdf

๐Ÿš€ Stay updated on the latest in privacy-preserving AIโ€”follow us on LinkedIn: https://www.linkedin.com/company/ai4privacy/posts/

#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset

  • 1 reply
ยท
MikeDoesย 
posted an update 29 days ago
view post
Post
1327
We can't build more private AI if we can't measure privacy intelligence.

That's why we're highlighting the Priv-IQ benchmark, a new, solution-oriented framework for evaluating LLMs on eight key privacy competencies, from visual privacy to knowledge of privacy law. The direct connection to our work is clear: the researchers relied on samples from the Ai4Privacy dataset to build out questions for Privacy Risk Assessment and Multilingual Entity Recognition.

This is the power of open-source collaboration. We provide the data building blocks, and researchers construct powerful new evaluation tools on top of them. It's a win-win for the entire ecosystem when we can all benefit from transparent, data-driven benchmarks that help push for better, safer AI.

Kudos to Sakib Shahriar and Rozita A. Dara for this important contribution. Read the paper to see the results: https://www.proquest.com/docview/3170854914?pq-origsite=gscholar&fromopenview=true&sourcetype=Scholarly%20Journals

#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset
MikeDoesย 
posted an update about 1 month ago
view post
Post
199
Traditional data leak prevention is failing. A new paper has a solution-oriented approach inspired by evolution.

The paper introduces a genetic-algorithm-driven method for detecting data leaks. To prove its effectiveness, the researchers Anatoliy Sachenko, Petro V., Oleg Savenko, Viktor Ostroverkhov, Bogdan Maslyyak from Casimir Pulaski Radom University and others needed a real-world, complex PII dataset. We're proud that the AI4Privacy PII 300k dataset was used as a key benchmark for their experiments.

This is the power of open-source collaboration. We provide complex, real-world data challenges, and brilliant researchers develop and share better solutions to solve them. It's a win for every organization when this research helps pave the way for more adaptive and intelligent Data Loss Prevention systems.

๐Ÿ”— Read the full paper to see the data and learn how genetic algorithms are making a difference in cybersecurity: https://ceur-ws.org/Vol-4005/paper19.pdf

#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset
MikeDoesย 
posted an update about 1 month ago
view post
Post
1983
Anonymizing a prompt is half the battle. Reliably de-anonymizing the response is the other.

To build a truly reliable privacy pipeline, you have to test it. A new Master's thesis does just that, and our data was there for every step.

We're excited to showcase this work on handling confidential data in LLM prompts from Nedim Karavdic at Mรคlardalen University. To build their PII anonymization pipeline, they first trained a custom NER model. We're proud that the Ai4Privacy pii-masking-200k dataset was used as the foundational training data for this critical first step.

But it didn't stop there. The research also used our dataset to create the parallel data needed to train and test the generative "Seek" models for de-anonymization. It's a win-win when our open-source data not only helps build the proposed "better solution" but also helps prove why it's better by enabling a rigorous, data-driven comparison.

๐Ÿ”— Check out the full thesis for a great deep-dive into building a practical, end-to-end privacy solution: https://www.diva-portal.org/smash/get/diva2:1980696/FULLTEXT01.pdf

#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset
MikeDoesย 
posted an update about 1 month ago
view post
Post
1309
The tools we use to audit AI for privacy might be easier to fool than we think.

We're highlighting a critical paper that introduces "PoisonM," a novel attack that could make Membership Inference tests unreliable. The direct connection to our work is explicit: the researchers, Neal M., Atul Prakash, Amrita Roy Chowdhury, Ashish Hooda, Kassem Fawaz, Somesh Jha, Zhuohang Li, and Brad Malin used the AI4Privacy dataset as the "canary" dataset in their experiments to test the effectiveness of their attack on realistic, sensitive information.

This is the power of a healthy open-source ecosystem. We provide the foundational data that helps researchers pressure-test our collective assumptions about AI safety. It's a win for everyone when this leads to a more honest conversation about what our tools can and can't do, pushing us all to create better solutions.

๐Ÿ”— Read the full paper to understand the fundamental flaws in current MI testing: https://arxiv.org/pdf/2506.06003

#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset
MikeDoesย 
posted an update about 1 month ago
view post
Post
4209
What if an AI agent could be tricked into stealing your data, just by reading a tool's description? A new paper reports it's possible.

The "Attractive Metadata Attack" paper details this stealthy new threat. To measure the real-world impact of their attack, the researchers needed a source of sensitive data for the agent to leak. We're proud that the AI4Privacy corpus was used to create the synthetic user profiles containing standardized PII for their experiments.

This is a perfect win-win. Our open-source data helped researchers Kanghua Mo, ้พ™ๆ˜ฑไธž, Zhihao Li from Guangzhou University and The Hong Kong Polytechnic University to not just demonstrate a new attack, but also quantify its potential for harm. This data-driven evidence is what pushes the community to build better, execution-level defenses for AI agents.

๐Ÿ”— Check out their paper to see how easily an agent's trust in tool metadata could be exploited: https://arxiv.org/pdf/2508.02110

#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset
MikeDoesย 
posted an update about 1 month ago
view post
Post
1818
How do you protect your prompts without breaking them? You need a smart sanitizer. A new system called Prฯตฯตmpt shows how.

The first, critical step in their solution is a high-performance Named Entity Recognition (NER) model to find the sensitive data. We're proud to see that these researchers, Amrita Roy Chowdhury, David Glukhov, Divyam Anshumaan, Prasad Chalasani, Nicolas Papernot, Somesh Jha, and Mihir Bellare from the University of Michigan, University of Toronto, University of Wisconsin-Madison, University of California, San Diego - Rady School of Management and Langroid Incorporated fine-tuned their NER model on 10 high-risk categories from the AI4Privacy dataset.

This is a perfect win-win. Our open-source data helps provide the foundation for the critical detection engine, which in turn enables the community to build and test better solutions like Prฯตฯตmpt's innovative use of encryption and Differential Privacy.

๐Ÿ”— Check out their paper for a deep dive into a formally private, high-utility prompt sanitizer: https://arxiv.org/pdf/2504.05147

#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset
MikeDoesย 
posted an update about 2 months ago
view post
Post
3152
Making LLMs fast with KV-cache sharing is great. A new paper reports it's also a huge privacy risk.

That's why we're excited to see the "SafeKV" paper from researchers at the University of Connecticut, Peking University, and others. Their solution-oriented framework selectively shares non-sensitive data while isolating PII. To validate the "Safe" part of their system, they needed a robust, multilingual privacy benchmark.

We're proud that the Ai4Privacy pii-masking dataset was used for this critical evaluation related to privacy.

This is a perfect win-win. Our open-source data enables researchers to build and validate more effective security solutions for core AI infrastructure. Their work, in turn, helps make the entire LLM ecosystem safer, showing that performance and privacy don't have to be mutually exclusive.

Kudos to Kexin Chu, Zecheng Lin, Dawei Xiang, ๆฒˆๅญๆ—ญ, Jianchang Su, cheng chu, Yiwei Yang, Wenhui Zhang, Wenfei Wu, and Wei Zhang on this beautiful work.

๐Ÿ”— Check out their paper to see the future of secure, high-performance LLM inference: https://arxiv.org/pdf/2508.08438

#OpenSource
#DataPrivacy
#LLM
#Anonymization
#AIsecurity
#HuggingFace
#Ai4Privacy
#Worldslargestopensourceprivacymaskingdataset
nouamanetaziย 
posted an update 3 months ago
view post
Post
4375
After training ๐’๐ฆ๐จ๐ฅ๐‹๐Œ๐Ÿ‘ on ๐Ÿ‘๐Ÿ–๐Ÿ’ ๐‡๐Ÿ๐ŸŽ๐ŸŽ๐ฌ for nearly a month, I've come to realize something most people overlook: ๐ข๐ง๐Ÿ๐ซ๐š๐ฌ๐ญ๐ซ๐ฎ๐œ๐ญ๐ฎ๐ซ๐ž ๐ข๐ฌ ๐ญ๐ก๐ž ๐ฆ๐š๐ค๐ž-๐จ๐ซ-๐›๐ซ๐ž๐š๐ค ๐Ÿ๐š๐œ๐ญ๐จ๐ซ ๐ข๐ง ๐‹๐‹๐Œ ๐ญ๐ซ๐š๐ข๐ง๐ข๐ง๐ . ๐Ÿ”ฅ

Everyone talks about model architecture and data quality. And yes, those matter immensely. But here's what nobody tells you: when your training run fails at 2 AM because of mysterious ๐๐‚๐‚๐‹ ๐ž๐ซ๐ซ๐จ๐ซ๐ฌ, or when your expensive GPU cluster is running at ๐Ÿ”๐ŸŽ% ๐ž๐Ÿ๐Ÿ๐ข๐œ๐ข๐ž๐ง๐œ๐ฒ, the problem isn't your model. It's most probably a ๐ฆ๐ข๐ฌ๐ฎ๐ฌ๐ž ๐จ๐Ÿ ๐ญ๐ก๐ž ๐ก๐š๐ซ๐๐ฐ๐š๐ซ๐ž. ๐Ÿ› ๏ธ

Questions that seemed simple but had no clear answers: Why is ๐Œ๐จ๐„ ๐ญ๐ซ๐š๐ข๐ง๐ข๐ง๐  ๐ฌ๐ฅ๐จ๐ฐ๐ž๐ซ ๐ญ๐ก๐š๐ง ๐๐ž๐ง๐ฌ๐ž ๐ฆ๐จ๐๐ž๐ฅ๐ฌ? Which ๐๐‚๐‚๐‹ ๐Ÿ๐ฅ๐š๐ ๐ฌ should we actually set? How often should we checkpoint without killing throughput?

That's why we built ๐“๐ก๐ž ๐’๐ฆ๐จ๐ฅ ๐“๐ซ๐š๐ข๐ง๐ข๐ง๐  ๐๐ฅ๐š๐ฒ๐›๐จ๐จ๐ค ๐Ÿ“–: a complete guide covering everything from model architecture and data curation to the SmolLM3 training marathon, post-training techniques, and crucially, the ๐ข๐ง๐Ÿ๐ซ๐š๐ฌ๐ญ๐ซ๐ฎ๐œ๐ญ๐ฎ๐ซ๐ž ๐ฅ๐š๐ฒ๐ž๐ซ that most teams get wrong.

We validated real vs theoretical bandwidth across the entire stack: ๐‡๐๐Œ๐Ÿ‘ ๐ก๐ข๐ญ๐ญ๐ข๐ง๐  ๐Ÿ‘ ๐“๐/๐ฌ, ๐๐•๐‹๐ข๐ง๐ค ๐Ÿ’.๐ŸŽ ๐ซ๐ž๐š๐œ๐ก๐ข๐ง๐  ๐Ÿ•๐Ÿ–๐Ÿ” ๐†๐/๐ฌ, ๐๐‚๐ˆ๐ž ๐†๐ž๐ง๐Ÿ’ ๐š๐ญ ๐Ÿ๐Ÿ’.๐Ÿ ๐†๐/๐ฌ. Then we ran collective operations across ๐Ÿ๐Ÿ๐Ÿ– ๐†๐๐”๐ฌ (16 nodes, 8xH100s each) and measured how performance degrades at scale: all-reduce drops from ๐Ÿ’๐Ÿ–๐ŸŽ ๐†๐/๐ฌ on a single node to ๐Ÿ‘๐Ÿ๐ŸŽ-๐Ÿ‘๐Ÿ“๐ŸŽ ๐†๐/๐ฌ across 16 nodes.

If you've ever wondered why your training runs are slower than they should be, or you're planning to scale up and want to avoid expensive mistakes, this guide might save you weeks of debugging.

๐“๐ก๐ž ๐’๐ฆ๐จ๐ฅ ๐“๐ซ๐š๐ข๐ง๐ข๐ง๐  ๐๐ฅ๐š๐ฒ๐›๐จ๐จ๐ค: https://lnkd.in/e5MKXUHS

Shared with โค๏ธ by the HuggingFace team
Tonicย 
posted an update 5 months ago
Tonicย 
posted an update 5 months ago
view post
Post
813
COMPUTER CONTROL IS ON-DEVICE !

๐Ÿก๐Ÿค– 78 % of EU smart-home owners DONโ€™T trust cloud voice assistants.

So we killed the cloud.

Meet Extรฉ: a palm-sized Android device that sees, hears & speaks your language - 100 % offline, 0 % data sent anywhere.

๐Ÿ”“ We submitted our technologies for consideration to the Liquid AI hackathon.

๐Ÿ“Š Dataset: 79 k UI-action pairs on Hugging Face (largest Android-control corpus ever) Tonic/android-operator-episodes

โšก Model: 98 % task accuracy, 678MB compressed , fits on existing android devices ! Tonic/l-android-control

๐Ÿ›ค๏ธ Experiment Tracker : check out the training on our TrackioApp Tonic/l-android-control

๐ŸŽฎ Live Model Demo: Upload an Android Screenshot and instructions to see the model in action ! Tonic/l-operator-demo



Built in a garage, funded by pre-orders, no VC. Now weโ€™re scaling to 1 k installer units.

Weโ€™re giving 50 limited-edition prototypes to investors , installers & researchers who want to co-design the sovereign smart home.

๐Ÿ‘‡ Drop โ€œEUSKERAโ€ in the comments if you want an invite, tag a friend who still thinks Alexa is โ€œconvenient,โ€ and smash โ™ฅ๏ธ if AI should belong to people - not servers.
ยท