-
Notifications
You must be signed in to change notification settings - Fork 58
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Brushed up the bullet points, added a the query to tag the flags, added explanation to the query
- Loading branch information
Showing
1 changed file
with
163 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,165 @@ | ||
The current metric of measuring quality data onboarding / potential abuse consists of: | ||
|
||
- Data stored in deals where the CID is shared across client DataCap applications (n-1*size of CID) - If two clients use the same dataset, if preparation is diff., then CID diff. | ||
- Data stored in deals where the CID has been stored more than 2 times with the same SP ID (n-1*size of CID) | ||
- Data stored in deals where the CID is shared across client DataCap applications | ||
- $(n-1)*\text{size of CID}$, where n is the number of duplicates | ||
- If two clients use the same dataset, but preparation is different, then we are counting them as different CIDs. | ||
- This likely means that the data is being duplicated many times and the client is storing it with the same SP | ||
- Data stored in deals where the CID has been stored more than 2 times with the same SP ID | ||
- $(n-1)*\text{size of CID}$, where n is the number of duplicates | ||
- This likely means that the same data is being shared and stored by different clients | ||
- All data stored by a client address where the client address has used between 10% and 90% of their DataCap allocation, has not made any deals in the last 6 weeks, and has stored everything in unique CIDs | ||
- This likely means that this not being duplicated at all and therefore not being distributed enough | ||
|
||
The following is the a query specifying how we will tag deals under these three flags and its explanation: | ||
|
||
``` | ||
CREATE OR REPLACE FUNCTION epoch_to_timestamp(epoch INTEGER) | ||
RETURNS INTEGER AS $$ | ||
BEGIN | ||
RETURN epoch * 30 + 1598306400; | ||
END; | ||
$$ LANGUAGE plpgsql; | ||
CREATE OR REPLACE FUNCTION tag_over_replicated_deals() | ||
RETURNS VOID AS $$ | ||
BEGIN | ||
INSERT INTO deal_tags (deal_id, cid_overreplicated, sector_start, piece_size) | ||
SELECT | ||
cs.deal_id, | ||
cs.row_number > 1 AS cid_overreplicated, | ||
TO_TIMESTAMP(epoch_to_timestamp(cs.sector_start_epoch)), | ||
cs.piece_size | ||
FROM ( | ||
SELECT | ||
*, | ||
ROW_NUMBER() OVER ( | ||
PARTITION BY piece_cid, provider | ||
ORDER BY sector_start_epoch | ||
) AS row_number | ||
FROM | ||
current_state | ||
WHERE | ||
verified_deal = true | ||
AND sector_start_epoch > 0 | ||
) AS cs | ||
ON CONFLICT (deal_id) | ||
DO UPDATE SET cid_overreplicated = EXCLUDED.cid_overreplicated, | ||
sector_start = EXCLUDED.sector_start, piece_size = EXCLUDED.piece_size; | ||
END; | ||
$$ LANGUAGE plpgsql; | ||
CREATE OR REPLACE FUNCTION tag_cid_sharing_deals() | ||
RETURNS VOID AS $$ | ||
BEGIN | ||
WITH owner_count AS ( | ||
SELECT | ||
piece_cid, | ||
COUNT(DISTINCT owner) AS owner_count | ||
FROM ( | ||
SELECT | ||
current_state.piece_cid, | ||
COALESCE(github_client_mapping.unique_id::TEXT, current_state.client) AS owner | ||
FROM | ||
current_state | ||
LEFT JOIN github_client_mapping ON current_state.client = github_client_mapping.client_address | ||
WHERE | ||
verified_deal = true | ||
AND sector_start_epoch > 0 | ||
) AS distinct_owners | ||
GROUP BY piece_cid | ||
) | ||
INSERT INTO deal_tags (deal_id, cid_shared, sector_start, piece_size) | ||
SELECT | ||
cs.deal_id, | ||
(cs.row_number > 1 AND oc.owner_count > 1) AS cid_shared, | ||
TO_TIMESTAMP(epoch_to_timestamp(cs.sector_start_epoch)), | ||
cs.piece_size | ||
FROM ( | ||
SELECT | ||
cs_with_owner.*, | ||
ROW_NUMBER() OVER ( | ||
PARTITION BY piece_cid, owner | ||
ORDER BY sector_start_epoch | ||
) AS row_number | ||
FROM ( | ||
SELECT | ||
current_state.*, | ||
COALESCE(github_client_mapping.unique_id::TEXT, current_state.client) AS owner | ||
FROM | ||
current_state | ||
LEFT JOIN github_client_mapping ON current_state.client = github_client_mapping.client_address | ||
WHERE | ||
verified_deal = true | ||
AND sector_start_epoch > 0 | ||
) AS cs_with_owner | ||
) AS cs | ||
JOIN owner_count oc ON cs.piece_cid = oc.piece_cid | ||
ON CONFLICT (deal_id) | ||
DO UPDATE SET cid_shared = EXCLUDED.cid_shared, | ||
sector_start = EXCLUDED.sector_start, piece_size = EXCLUDED.piece_size; | ||
END; | ||
$$ LANGUAGE plpgsql; | ||
CREATE OR REPLACE FUNCTION tag_cid_unique_deals() | ||
RETURNS VOID AS $$ | ||
BEGIN | ||
WITH eligible_clients AS ( | ||
SELECT | ||
COALESCE(gcm.unique_id::TEXT, cs.client) AS grouped_client, | ||
SUM(cs.piece_size) AS total_storage, | ||
MAX(cs.sector_start_epoch) AS max_sector_start_epoch | ||
FROM | ||
current_state cs | ||
LEFT JOIN github_client_mapping gcm ON cs.client = gcm.client_address | ||
WHERE | ||
cs.verified_deal = true | ||
AND cs.sector_start_epoch > 0 | ||
GROUP BY | ||
grouped_client | ||
HAVING | ||
SUM(cs.piece_size) > (1::BIGINT * 1024 * 1024 * 1024 * 1024) | ||
AND (TO_TIMESTAMP(epoch_to_timestamp(MAX(cs.sector_start_epoch))) + interval '6 weeks') <= NOW() | ||
), | ||
unique_deals AS ( | ||
SELECT | ||
cs.*, | ||
COALESCE(gcm.unique_id::TEXT, cs.client) AS grouped_client, | ||
ROW_NUMBER() OVER ( | ||
PARTITION BY COALESCE(gcm.unique_id::TEXT, cs.client), cs.piece_cid | ||
ORDER BY cs.sector_start_epoch | ||
) AS row_number | ||
FROM | ||
current_state cs | ||
LEFT JOIN github_client_mapping gcm ON cs.client = gcm.client_address | ||
WHERE | ||
cs.verified_deal = true | ||
AND cs.sector_start_epoch > 0 | ||
) | ||
INSERT INTO deal_tags (deal_id, cid_unique, sector_start, piece_size) | ||
SELECT | ||
ud.deal_id, | ||
(ud.row_number = 1 AND ec.grouped_client IS NOT NULL) AS cid_unique, | ||
TO_TIMESTAMP(epoch_to_timestamp(ud.sector_start_epoch)), | ||
ud.piece_size | ||
FROM | ||
unique_deals ud | ||
LEFT JOIN eligible_clients ec ON ud.grouped_client = ec.grouped_client | ||
ON CONFLICT (deal_id) | ||
DO UPDATE SET | ||
cid_unique = EXCLUDED.cid_unique, | ||
sector_start = EXCLUDED.sector_start, | ||
piece_size = EXCLUDED.piece_size; | ||
END; | ||
$$ LANGUAGE plpgsql; | ||
``` | ||
This code defines four PostgreSQL functions using PL/pgSQL, a procedural language for PostgreSQL. These functions are used to tag and process deal data related to file storage. | ||
|
||
epoch_to_timestamp(epoch INTEGER): This function takes an integer epoch as input and returns an integer. It converts the given epoch value to a timestamp by multiplying it by 30 and then adding the constant 1598306400. The returned value represents Unix time. | ||
|
||
tag_over_replicated_deals(): This function tags deals that are over-replicated. It does so by inserting or updating deal tags in the deal_tags table based on the calculated cid_overreplicated value, which indicates whether the deal is over-replicated or not. The function also sets sector_start and piece_size columns for the corresponding deal. | ||
|
||
tag_cid_sharing_deals(): This function tags deals that share content identifiers (CIDs) across multiple clients. It calculates the cid_shared value, which indicates whether the deal shares a CID with other deals, and inserts or updates the deal_tags table with this value. The function also sets sector_start and piece_size columns for the corresponding deal. | ||
|
||
tag_cid_unique_deals(): This function tags deals with unique content identifiers (CIDs) for clients that have stored more than 1 TiB of data and have not added any new storage within the last 6 weeks. It calculates the cid_unique value, which indicates whether the deal has a unique CID or not, and inserts or updates the deal_tags table with this value. The function also sets sector_start and piece_size columns for the corresponding deal. | ||
|
||
In summary, these four functions are designed to process and tag deal data in a PostgreSQL database, which can be useful for tracking and analyzing data storage deals. |