In January 2023, hackers leaked the source code of parts of Yandex services. It was one of the company’s biggest leaks. The volume of the archives is almost 45 GB.
According to rumors, the drain was organized by a disgruntled employee. It does not contain data on nature, but it did reveal details about the work of Yandex services.
Yandex Search, Alice, Taxi, Mail, and Disk were all affected by the outliers.
In addition, the published archives contained the code for two Yandex analytics systems: Metrics and Crypto. Wired journalists decided to study the code with experts, and choose how Yandex collects and offers information.
Yandex collects a huge amount of data to display ads
Yandex services collect large amounts of data about people. They can be used to identify user interests when they are “matched and analyzed” against all the information a company owns. This is according to a study by Kayleigh McCree, a privacy engineer at cybersecurity company Confiant.
Based on the timestamps included in the data, the code was changed in July 2022. It is mostly written in Russian and contains racist slurs. Yandex stated that this does not strike the eye at all for the operation of services, but is “deeply offensive and completely unacceptable.”
McCree analyzed the code of Metrica and Krypta. Yandex Metrica is an analogue of Google Analytics, which allows site owners to evaluate statistics and user behavior. Data from Yandex.Metrica is transferred to Krypta, a service for selecting personalized ads.
This technology allows advertisers to target their target audience as accurately as possible. To find out if a user belongs to a given segment, Krypta can find him on the Internet.
Yandex
The company claims that Crypto analyzes about 300 factors using various machine learning methods..
All applications and services that are in Yandex, and there are supposed to be more than 90 of them, in one form or another transfer data to the Crypto to create advertising segments.
Kayleigh McCree, Confiant Privacy Engineer
Some data is transmitted when people receive Yandex services. For example, owning their location to see where they are on a map.
The information is collected automatically. The company can find search data, location, search history, home and work address, music listening and browsing, email data, and more.
Metrika’s source code showed that the service can collect geolocation data, including altitude, direction, and speed. Metrica also remembers the names of Wi-Fi networks that people connect to.
Yandex registers users in segments. There are countless of them
All data that Metrica collects is transferred to the Crypto. They are then bound to identifiers, which are additionally hashed.
A user for a Crypto is not a specific person with a first and last name, but a set of identifiers. But why set? The fact is that each specific device and object used by a person to access the network has its own identifier – a cookie that sites use to detect the user and, for example, not ask each time for a password to enter. There are also personal identifiers in applications – if an application (for example, Maps or Navigator) sends data to the Yandex server, the information from its identifier is also visited by Crypt.
Yandex
Crypto means when different IDs of the same object are used. After that, the Crypt plays people in segments on different topics, but the views can be shown the same ad.
Crypto analyzes a person’s behavior on the Internet and “calculates the probability” of his belonging to or this named segment.
The amount of data that Yandex receives through Yandex.Metrica is so huge that it is simply impossible to even imagine. This is enough to create any group or group segment.
Grigory Bakunov, former director of technology dissemination at Yandex
The segments that Krypt creates are very specific, but at the same time they differ in how diverse the data of our online life is when it is collected in one place. Among them there are groups of people who use Yandex Stations, movie lovers can be grouped by genre, there are segments of laptop users who searched for the Radisson hotel on the map.
An example of segments in Krypt.
The “smokers” group tracks people who buy smoking-related products, such as e-cigarettes. “Dachniks” can find people who have dachas using location data. The traveler segment also uses geolocation to find travelers that break down into jams and foods. Part of the error code was received for getting data from the Mail apps and detecting the “hotels” and “boarding passes” fields.
Yandex can combine identifiers into “families”if their IP addresses “intersect”. Data about the “family” is the number of people, their sex and age.
Yandex services allow you to predict whether children have children. For example, people can order taxis with child seats. According to Yandex data protection director Ivan Cherevko, it may be hidden that access will be of interest to content for parents.
One element in the Crypta code shows how all this data can be combined. There is a user interface that acts like someone’s profile. This interface shows a person’s marital status, projected income, having children, and three hobbies that include general topics such as home appliances, food, clothing, and leisure.
Cherevko said it was “Yandex’s internal tool” where employees can see how Krypta’s algorithms classify them, and they can only access their own information.
Collecting this amount of data is standard practice for Internet companies
Google Analytics interface, similar to Yandex Metrics.
McCrea noted that some of this information “doesn’t seem out of the ordinary” for online advertising. Ivan Cherevko is of the same opinion.
He added that grouping users by interest is “standard industry practice.” The collection of information allows you to show people specific ads: “goods for the garden of users who are interested in summer cottages and auto parts – those who visit the gas station.” All data in Yandex is anonymized.
The user’s crypts are represented as a set of identifiers, and the system cannot associate them with the appearance on the surface of the world. Such a set is only probabilistic.
Ivan Cherevko, Yandex Data Protection Director
The crypt does not have access to users’ emails. The hotel and boarding pass information found in the Mail code was an experiment. The crypt of the recipient from the Mail is only anonymized information, but this method has not been used since 2019. Cherevko also said that Yandex will have access to the geolocation of users collected by Metrica in 14 days.
How Yandex actually collects information is unknown
Screenshot of the forum where the archive was posted.
45 GB of source code covers many Yandex services. The main programming languages used are Python, C++ and YQL.
The leak contains only code, not a real repository that would request the history of the framework. This means that you can only suspect what the code is doing, but it is not possible to establish exactly which parts of the protocol are discovered or will currently occur.
Cherevko claims that “code snippets” are outdated, and that part of the source code is “the real one never used” by Yandex.
Also, according to a Yandex representative, the company uses user data only to create new services and use data. it never sells data or exposes a quarter of its outer surface to the unconscious user.
Source: Iphones RU

I am a professional journalist and content creator with extensive experience writing for news websites. I currently work as an author at Gadget Onus, where I specialize in covering hot news topics. My written pieces have been published on some of the biggest media outlets around the world, including The Guardian and BBC News.