Skip to content

Commit

Permalink
added query
Browse files Browse the repository at this point in the history
Brushed up the bullet points, added a the query to tag the flags, added explanation to the query
  • Loading branch information
panges2 authored May 3, 2023
1 parent 8efafbc commit baaed2a
Showing 1 changed file with 163 additions and 3 deletions.
166 changes: 163 additions & 3 deletions quality-tracking-metrics.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,165 @@
The current metric of measuring quality data onboarding / potential abuse consists of:

- Data stored in deals where the CID is shared across client DataCap applications (n-1*size of CID) - If two clients use the same dataset, if preparation is diff., then CID diff.
- Data stored in deals where the CID has been stored more than 2 times with the same SP ID (n-1*size of CID)
- Data stored in deals where the CID is shared across client DataCap applications
- $(n-1)*\text{size of CID}$, where n is the number of duplicates
- If two clients use the same dataset, but preparation is different, then we are counting them as different CIDs.
- This likely means that the data is being duplicated many times and the client is storing it with the same SP
- Data stored in deals where the CID has been stored more than 2 times with the same SP ID
- $(n-1)*\text{size of CID}$, where n is the number of duplicates
- This likely means that the same data is being shared and stored by different clients
- All data stored by a client address where the client address has used between 10% and 90% of their DataCap allocation, has not made any deals in the last 6 weeks, and has stored everything in unique CIDs
- This likely means that this not being duplicated at all and therefore not being distributed enough

The following is the a query specifying how we will tag deals under these three flags and its explanation:

```
CREATE OR REPLACE FUNCTION epoch_to_timestamp(epoch INTEGER)
RETURNS INTEGER AS $$
BEGIN
RETURN epoch * 30 + 1598306400;
END;
$$ LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION tag_over_replicated_deals()
RETURNS VOID AS $$
BEGIN
INSERT INTO deal_tags (deal_id, cid_overreplicated, sector_start, piece_size)
SELECT
cs.deal_id,
cs.row_number > 1 AS cid_overreplicated,
TO_TIMESTAMP(epoch_to_timestamp(cs.sector_start_epoch)),
cs.piece_size
FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY piece_cid, provider
ORDER BY sector_start_epoch
) AS row_number
FROM
current_state
WHERE
verified_deal = true
AND sector_start_epoch > 0
) AS cs
ON CONFLICT (deal_id)
DO UPDATE SET cid_overreplicated = EXCLUDED.cid_overreplicated,
sector_start = EXCLUDED.sector_start, piece_size = EXCLUDED.piece_size;
END;
$$ LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION tag_cid_sharing_deals()
RETURNS VOID AS $$
BEGIN
WITH owner_count AS (
SELECT
piece_cid,
COUNT(DISTINCT owner) AS owner_count
FROM (
SELECT
current_state.piece_cid,
COALESCE(github_client_mapping.unique_id::TEXT, current_state.client) AS owner
FROM
current_state
LEFT JOIN github_client_mapping ON current_state.client = github_client_mapping.client_address
WHERE
verified_deal = true
AND sector_start_epoch > 0
) AS distinct_owners
GROUP BY piece_cid
)
INSERT INTO deal_tags (deal_id, cid_shared, sector_start, piece_size)
SELECT
cs.deal_id,
(cs.row_number > 1 AND oc.owner_count > 1) AS cid_shared,
TO_TIMESTAMP(epoch_to_timestamp(cs.sector_start_epoch)),
cs.piece_size
FROM (
SELECT
cs_with_owner.*,
ROW_NUMBER() OVER (
PARTITION BY piece_cid, owner
ORDER BY sector_start_epoch
) AS row_number
FROM (
SELECT
current_state.*,
COALESCE(github_client_mapping.unique_id::TEXT, current_state.client) AS owner
FROM
current_state
LEFT JOIN github_client_mapping ON current_state.client = github_client_mapping.client_address
WHERE
verified_deal = true
AND sector_start_epoch > 0
) AS cs_with_owner
) AS cs
JOIN owner_count oc ON cs.piece_cid = oc.piece_cid
ON CONFLICT (deal_id)
DO UPDATE SET cid_shared = EXCLUDED.cid_shared,
sector_start = EXCLUDED.sector_start, piece_size = EXCLUDED.piece_size;
END;
$$ LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION tag_cid_unique_deals()
RETURNS VOID AS $$
BEGIN
WITH eligible_clients AS (
SELECT
COALESCE(gcm.unique_id::TEXT, cs.client) AS grouped_client,
SUM(cs.piece_size) AS total_storage,
MAX(cs.sector_start_epoch) AS max_sector_start_epoch
FROM
current_state cs
LEFT JOIN github_client_mapping gcm ON cs.client = gcm.client_address
WHERE
cs.verified_deal = true
AND cs.sector_start_epoch > 0
GROUP BY
grouped_client
HAVING
SUM(cs.piece_size) > (1::BIGINT * 1024 * 1024 * 1024 * 1024)
AND (TO_TIMESTAMP(epoch_to_timestamp(MAX(cs.sector_start_epoch))) + interval '6 weeks') <= NOW()
),
unique_deals AS (
SELECT
cs.*,
COALESCE(gcm.unique_id::TEXT, cs.client) AS grouped_client,
ROW_NUMBER() OVER (
PARTITION BY COALESCE(gcm.unique_id::TEXT, cs.client), cs.piece_cid
ORDER BY cs.sector_start_epoch
) AS row_number
FROM
current_state cs
LEFT JOIN github_client_mapping gcm ON cs.client = gcm.client_address
WHERE
cs.verified_deal = true
AND cs.sector_start_epoch > 0
)
INSERT INTO deal_tags (deal_id, cid_unique, sector_start, piece_size)
SELECT
ud.deal_id,
(ud.row_number = 1 AND ec.grouped_client IS NOT NULL) AS cid_unique,
TO_TIMESTAMP(epoch_to_timestamp(ud.sector_start_epoch)),
ud.piece_size
FROM
unique_deals ud
LEFT JOIN eligible_clients ec ON ud.grouped_client = ec.grouped_client
ON CONFLICT (deal_id)
DO UPDATE SET
cid_unique = EXCLUDED.cid_unique,
sector_start = EXCLUDED.sector_start,
piece_size = EXCLUDED.piece_size;
END;
$$ LANGUAGE plpgsql;
```
This code defines four PostgreSQL functions using PL/pgSQL, a procedural language for PostgreSQL. These functions are used to tag and process deal data related to file storage.

epoch_to_timestamp(epoch INTEGER): This function takes an integer epoch as input and returns an integer. It converts the given epoch value to a timestamp by multiplying it by 30 and then adding the constant 1598306400. The returned value represents Unix time.

tag_over_replicated_deals(): This function tags deals that are over-replicated. It does so by inserting or updating deal tags in the deal_tags table based on the calculated cid_overreplicated value, which indicates whether the deal is over-replicated or not. The function also sets sector_start and piece_size columns for the corresponding deal.

tag_cid_sharing_deals(): This function tags deals that share content identifiers (CIDs) across multiple clients. It calculates the cid_shared value, which indicates whether the deal shares a CID with other deals, and inserts or updates the deal_tags table with this value. The function also sets sector_start and piece_size columns for the corresponding deal.

tag_cid_unique_deals(): This function tags deals with unique content identifiers (CIDs) for clients that have stored more than 1 TiB of data and have not added any new storage within the last 6 weeks. It calculates the cid_unique value, which indicates whether the deal has a unique CID or not, and inserts or updates the deal_tags table with this value. The function also sets sector_start and piece_size columns for the corresponding deal.

In summary, these four functions are designed to process and tag deal data in a PostgreSQL database, which can be useful for tracking and analyzing data storage deals.

0 comments on commit baaed2a

Please sign in to comment.