added query

Brushed up the bullet points, added a the query to tag the flags, added explanation to the query
filecoin-project · May 3, 2023 · baaed2a · baaed2a
1 parent 8efafbc
commit baaed2a
Showing 1 changed file with 163 additions and 3 deletions.
diff --git a/quality-tracking-metrics.md b/quality-tracking-metrics.md
@@ -1,5 +1,165 @@
 The current metric of measuring quality data onboarding / potential abuse consists of:
-
-- Data stored in deals where the CID is shared across client DataCap applications (n-1*size of CID) - If two clients use the same dataset, if preparation is diff., then CID diff.
-- Data stored in deals where the CID has been stored more than 2 times with the same SP ID (n-1*size of CID)
+- Data stored in deals where the CID is shared across client DataCap applications 
+    - $(n-1)*\text{size of CID}$, where n is the number of duplicates
+    - If two clients use the same dataset, but preparation is different, then we are counting them as different CIDs.
+    - This likely means that the data is being duplicated many times and the client is storing it with the same SP
+- Data stored in deals where the CID has been stored more than 2 times with the same SP ID
+    - $(n-1)*\text{size of CID}$, where n is the number of duplicates
+    - This likely means that the same data is being shared and stored by different clients
 - All data stored by a client address where the client address has used between 10% and 90% of their DataCap allocation, has not made any deals in the last 6 weeks, and has stored everything in unique CIDs
+    - This likely means that this not being duplicated at all and therefore not being distributed enough
+
+The following is the a query specifying how we will tag deals under these three flags and its explanation:
+
+```
+CREATE OR REPLACE FUNCTION epoch_to_timestamp(epoch INTEGER)
+    RETURNS INTEGER AS $$
+BEGIN
+    RETURN epoch * 30 + 1598306400;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE OR REPLACE FUNCTION tag_over_replicated_deals()
+    RETURNS VOID AS $$
+BEGIN
+    INSERT INTO deal_tags (deal_id, cid_overreplicated, sector_start, piece_size)
+    SELECT
+        cs.deal_id,
+        cs.row_number > 1 AS cid_overreplicated,
+        TO_TIMESTAMP(epoch_to_timestamp(cs.sector_start_epoch)),
+        cs.piece_size
+    FROM (
+             SELECT
+                 *,
+                 ROW_NUMBER() OVER (
+                     PARTITION BY piece_cid, provider
+                     ORDER BY sector_start_epoch
+                     ) AS row_number
+             FROM
+                 current_state
+             WHERE
+                     verified_deal = true
+               AND sector_start_epoch > 0
+         ) AS cs
+    ON CONFLICT (deal_id)
+        DO UPDATE SET cid_overreplicated = EXCLUDED.cid_overreplicated,
+                      sector_start = EXCLUDED.sector_start, piece_size = EXCLUDED.piece_size;
+END;
+$$ LANGUAGE plpgsql;
+
+CREATE OR REPLACE FUNCTION tag_cid_sharing_deals()
+    RETURNS VOID AS $$
+BEGIN
+    WITH owner_count AS (
+        SELECT
+            piece_cid,
+            COUNT(DISTINCT owner) AS owner_count
+        FROM (
+                 SELECT
+                     current_state.piece_cid,
+                     COALESCE(github_client_mapping.unique_id::TEXT, current_state.client) AS owner
+                 FROM
+                     current_state
+                         LEFT JOIN github_client_mapping ON current_state.client = github_client_mapping.client_address
+                 WHERE
+                         verified_deal = true
+                   AND sector_start_epoch > 0
+             ) AS distinct_owners
+        GROUP BY piece_cid
+    )
+    INSERT INTO deal_tags (deal_id, cid_shared, sector_start, piece_size)
+    SELECT
+        cs.deal_id,
+        (cs.row_number > 1 AND oc.owner_count > 1) AS cid_shared,
+        TO_TIMESTAMP(epoch_to_timestamp(cs.sector_start_epoch)),
+        cs.piece_size
+    FROM (
+             SELECT
+                 cs_with_owner.*,
+                 ROW_NUMBER() OVER (
+                     PARTITION BY piece_cid, owner
+                     ORDER BY sector_start_epoch
+                     ) AS row_number
+             FROM (
+                      SELECT
+                          current_state.*,
+                          COALESCE(github_client_mapping.unique_id::TEXT, current_state.client) AS owner
+                      FROM
+                          current_state
+                              LEFT JOIN github_client_mapping ON current_state.client = github_client_mapping.client_address
+                      WHERE
+                              verified_deal = true
+                        AND sector_start_epoch > 0
+                  ) AS cs_with_owner
+         ) AS cs
+             JOIN owner_count oc ON cs.piece_cid = oc.piece_cid
+    ON CONFLICT (deal_id)
+        DO UPDATE SET cid_shared = EXCLUDED.cid_shared,
+                      sector_start = EXCLUDED.sector_start, piece_size = EXCLUDED.piece_size;
+END;
+$$ LANGUAGE plpgsql;
+
+
+CREATE OR REPLACE FUNCTION tag_cid_unique_deals()
+    RETURNS VOID AS $$
+BEGIN
+    WITH eligible_clients AS (
+        SELECT
+            COALESCE(gcm.unique_id::TEXT, cs.client) AS grouped_client,
+            SUM(cs.piece_size) AS total_storage,
+            MAX(cs.sector_start_epoch) AS max_sector_start_epoch
+        FROM
+            current_state cs
+                LEFT JOIN github_client_mapping gcm ON cs.client = gcm.client_address
+        WHERE
+                cs.verified_deal = true
+          AND cs.sector_start_epoch > 0
+        GROUP BY
+            grouped_client
+        HAVING
+                SUM(cs.piece_size) > (1::BIGINT * 1024 * 1024 * 1024 * 1024)
+           AND (TO_TIMESTAMP(epoch_to_timestamp(MAX(cs.sector_start_epoch))) + interval '6 weeks') <= NOW()
+    ),
+         unique_deals AS (
+             SELECT
+                 cs.*,
+                 COALESCE(gcm.unique_id::TEXT, cs.client) AS grouped_client,
+                 ROW_NUMBER() OVER (
+                     PARTITION BY COALESCE(gcm.unique_id::TEXT, cs.client), cs.piece_cid
+                     ORDER BY cs.sector_start_epoch
+                     ) AS row_number
+             FROM
+                 current_state cs
+                     LEFT JOIN github_client_mapping gcm ON cs.client = gcm.client_address
+             WHERE
+                     cs.verified_deal = true
+               AND cs.sector_start_epoch > 0
+         )
+    INSERT INTO deal_tags (deal_id, cid_unique, sector_start, piece_size)
+    SELECT
+        ud.deal_id,
+        (ud.row_number = 1 AND ec.grouped_client IS NOT NULL) AS cid_unique,
+        TO_TIMESTAMP(epoch_to_timestamp(ud.sector_start_epoch)),
+        ud.piece_size
+    FROM
+        unique_deals ud
+            LEFT JOIN eligible_clients ec ON ud.grouped_client = ec.grouped_client
+    ON CONFLICT (deal_id)
+        DO UPDATE SET
+                      cid_unique = EXCLUDED.cid_unique,
+                      sector_start = EXCLUDED.sector_start,
+                      piece_size = EXCLUDED.piece_size;
+END;
+$$ LANGUAGE plpgsql;
+```
+This code defines four PostgreSQL functions using PL/pgSQL, a procedural language for PostgreSQL. These functions are used to tag and process deal data related to file storage.
+
+epoch_to_timestamp(epoch INTEGER): This function takes an integer epoch as input and returns an integer. It converts the given epoch value to a timestamp by multiplying it by 30 and then adding the constant 1598306400. The returned value represents Unix time.
+
+tag_over_replicated_deals(): This function tags deals that are over-replicated. It does so by inserting or updating deal tags in the deal_tags table based on the calculated cid_overreplicated value, which indicates whether the deal is over-replicated or not. The function also sets sector_start and piece_size columns for the corresponding deal.
+
+tag_cid_sharing_deals(): This function tags deals that share content identifiers (CIDs) across multiple clients. It calculates the cid_shared value, which indicates whether the deal shares a CID with other deals, and inserts or updates the deal_tags table with this value. The function also sets sector_start and piece_size columns for the corresponding deal.
+
+tag_cid_unique_deals(): This function tags deals with unique content identifiers (CIDs) for clients that have stored more than 1 TiB of data and have not added any new storage within the last 6 weeks. It calculates the cid_unique value, which indicates whether the deal has a unique CID or not, and inserts or updates the deal_tags table with this value. The function also sets sector_start and piece_size columns for the corresponding deal.
+
+In summary, these four functions are designed to process and tag deal data in a PostgreSQL database, which can be useful for tracking and analyzing data storage deals.