This is a very long post so take a breathe and get ready for an usual Daniele’s “spiegone”
I am part of Mozilla Italia since 2013, contributing first in Firefox OS development and events/promotion (I did the localization of the development e-book as example) next in WebExtensions promotion and many other things as coder in various portal and many projects.
This is an assumption to show that is not the first rodeo for me (for the people that doesn’t know my involvement in Mozilla in details).
After a call with Ruben Martin (Mozilla Community Strategist now and is following Common Voice) I started a document about ideas and needs from Mozilla Italia to gather more people on contributing to CV (for the rest of the article).
This documents was shared in the Mozilla Italia monthly calls but also on telegram to get a reviews and gather feedback, to achieve the final version that is now (in English).
With we/us in this article I refer to other volunteers (and me is some cases) that worked in the same task. Saverio, Simone, Giovanni, Stefania, Damiano, Pasquale, Enrico, Carmelo and many others.
This document is our roadmap for the various things that we can do to get more contributors and improve the promotion to CV and also on our community.
Just do a step back to an year ago in July 2018 when we unlocked Italian to CV website, it was a difficult task, we gathered and reviewed 11000 sentences from various public domain sources (excluding wikipedia for various legal reasons). We defined also our rules like maximum 120 characters and a team work on reviewing the grammar rules but also finding this amount of various sources to improve the entropy.
We did in August 2018 a giveaway with awards (we had a lot of mozilla swag) based on amount of sentences recorded in a hacker event in Italy to get the first 10 hours of audio and also to get feedback on our work.
What we learned was to change some grammar rules but also sentences rules that improved the quality overall, because if the Italian language is written bad all the project got bad feedback/promotion with people that was not helping with the promotion itself.
In the meantime Mozilla to help many other languages that doesn’t have a community like us that can do this human work, created a new tool: the sentence collector. As this tool is open source we added support for Italian using our rules from the first bunch of sentences we reviewed manually.
Then we discovered that Sentence Collector is not suitable for us, it is very helpful on scaling with other languages because everyone can upload new sentences, later they will be analyzed by the rules (if they are available for that language) and it is possible to approve with 3 approval (or 3 to reject) by other people that can login in this tool.
Compared to our first bunch this created more issues:
- We cannot create a review team that has skill in Italian grammar rules, better Italian knowledge
- We cannot create a sources team because everyone can upload sentences and is not possible to remove it easily, because we have rules also from the sources to use (like as example only content after 1920/30, because Italian changed a lot in that period and can be very different)
- We cannot know who is the approver or the uploader and reach them to send them the rules that we created
- The sentences need to be approved clicking a button in a view of 5 sentences, this means that require at least 3 people that review them and is very time-consuming compared to the previous solution that was a file that we can also edit (and use with tools like github to track changes or spellcheck)
This tool basically broke the thrust about the sentences quality that required a lot of work for us. So we chose after few tests to abandon it also if there is people that we don’t know that is uploading stuff and approving them for Italian to focus on other stuff.
I discussed few times with Ruben about the issues that this tool has for us without any ways to change it (there are no contributors that work on that) but also because they need this workflow to scale with other languages. At the same time Mozilla want to create a different community for CV instead to be part of the local one, in our case of Mozilla Italia. This split explains also why there is another tool and this approach but create issues on reviewing and project brand that for Mozilla is not a real problem right now.
Also, there was another project that was created to scale more easily and get millions of sentences instead of thousands, the Wikipedia scraper. This was another open source tool, we added the Italian rules on July 2019 and tested this scraping that take random sentences from the Wikipedia dump of a specific language.
This enabled for us to get sentences reviewed with high quality but at the same time the same tool and the various to generate the word blacklist (we added for Italian in October 2019) have as today various bugs.
These tools doesn’t have enough contributors to fix the various issues (I helped on the blacklist generator) including that now the sentences with issue cannot be removed but only added new ones.
Basically remove the sentences already in the system require a manual selection, is not possible to do a bulk of them right now, also if it is possible until we don’t fix the tools to generate the sentence list will be quite useless. Also now Common Voice let to report sentences but is not possible to see them and is not suggested to have contributors under the legal age of 18 because the sentences are not validated for that age.
Fast forward to the document, we chosen to “attack” to get more contributors. We had until November 10 hours of recordings in 5 months, very few to get a dataset helpful.
- National promotion
- Generate the DeepSpeech model for voice recognition
- Promote the model
- Understand the contributors audience
- Reaching universities/institute help
As team we are still working on the various aspects and the document includes the various updates, ticket opened and also the blockers we have.
Right now we generated the model in October, working on the scripts for Italian with the first release of the Italian dataset from Common Voice (40 hours), and few datasets we found. We are working also on other aspects like a better Italian text corpus and define new audio datasets.
At the same time the model as it is now is not very good, so we are working to improve it to get better recognition, or it is enough to be used.
The big problem we have right now is that the audio+text/text Italian language datasets are from the academic world, and they don’t release with a license. Just require a mention of the academic paper, and we don’t know how to deal with that for a public domain model, so I started a public discussion on Mozilla Discourse to get some feedback about how to use them and what we have to do. Right now we don’t have any feedback in this discussion that help us on moving on this project.
National promotion, Promote the model and Universities/Institutions
In November 2019 Mozilla added Italian to the language that promotes in about:home page of Firefox, this was the most powerful promotion we got also if we now are stalling. We got 50 hours in less of a month of new recordings, now we have 100+ hours, but we need to do more on reviewing them.
When we reached Italian journalists to promote the project they weren’t interested in the project but more about the privacy concerns of the voice recognition tools, so we worked on something else. Instead, when we reached universities they were more interested in DeepSpeech and use the Italian dataset for their purposes.
So we stopped the promotion until the model was ready to have something to show at event and people to gather more contributors and volunteers to the Mozilla Italia community. In these ways we got some employees of companies that works with machine learning but also university students interested on helping us as contributor for the model itself. We found also people that gave us their servers to generate the model!
One of the examples of that new people was the creation of a Telegram bot that compare the recording with our model and Google Speech, more feedback in the promotion but also people interested in contributing in other Mozilla area like localization.
So next step is to promote better the model, and we have few ideas (check the document), in these ways when will start to talk again with Journalists/Universities/Institutions we have something real to show that works and the opportunities that can create a free and open alternative.
I started in October also a forum thread with updates (in Italian) about what we are/was doing in Common Voice and DeepSpeech to help all the volunteers and also newcomers to get a recap of everything.
About the national promotion we saw also that newcomers has always the same questions that the CV website don’t explain, so we are working on different Italian videos that explain:
- The origins of sentences (people want to know that)
- Why contribute, not just for a dataset but real usage
- How to do the best audio recording
- How to use it
We have already few video draft, but we are still working on that, maybe the next month we will publish them.
Also, our Italian telegram groups grown a lot with new people and more discussions (check @mozitabot on Telegram) and getting feedback to move on the community itself.
As reference before we started working in CV (2018) the community wasn’t growing but after a year the situation is completely different. We have new people also on our monthly calls that now are public on youtube.
Understand the contributors audience and Recognition
These 2 points are not moving on because CV doesn’t let to understand right now who are the contributors and reach them.
For recognition instead as I have swag, when I meet people I give to them t-shirts or stickers but is not scaling because our volunteers are distributed in all the country.
We want a unique community that lead CV and DeepSpeech model for Italian because we started, we are quality warranty of the project and revive our community letting us to grow but also do new things.
Compared to the start now we have still some problems as already written about sentence quality that create issues to the project, also to the various tools where there aren’t coders that help on them (the sad part is that Mozilla is not promoting to recruit contributors for them).
With a community leading of the various tasks, delegating and organizing events to gather feedback from other Italian helped us to understand what we have to do help our language. With a decentralized community this wasn’t possible, including all the learnings that we had (including again something to play with the model). I have to affirm that this leading was part also of the experience of my free book that I wanted to use to a high level compared to before, just to test new things.
At the same time Mozilla promotion let us to scale with new people and recordings, something that alone we cannot achieve. We need more human power to the various tasks but our forum and Telegram was very important to do everything.
The big amount of new people started in september with the model with promotion on Reddit Italian subreddits at the same time official Mozilla portals like CV don’t enable us to add custom link the home page link to the model itself (one of our suggestions that you can find in the roadmap).
Personally I think that create another community without any reference to the local one (if exists) is against the Mozilla manifesto, because is not inclusive of everyone and:
We are committed to an internet that catalyzes collaboration among diverse communities working together for the common good.
Individuals must have the ability to shape the internet and their own experiences on it.
The effectiveness of the internet as a public resource depends upon interoperability (protocols, data formats, content), innovation and decentralized participation worldwide.
This article was inspired by the request of Ruben itself to explain what we did, how and what we achieved.
If you want to discuss about it, let’s do it on Mozilla’s Discourse.