OCT 2nd 2012 - Talked with hollin
Hollin brought up a message framework tool named STORM that would likely allow us to solve the durability issue.
I was attempting to stream body as it came in into the riak db - which is what i think killed it.
Essentially There is a gap between 200 OK and where the filesystem has the one and only copy before it get's written the the slow log.
The idea that we discussed is around how we are saying "It's OK" to the mobile device when we havn't actually delivered the package - their is a gap in time before the package is off the filesystem localy and into what i am calling the slow-log (the database that can be used for evaluating the finished/unfinished requests for analytics or understading if the mobile's network or /server is healthy)
One possible solution is to use the traditional erlang approach of 3 nodes for 1 node and stream the file over to the 2 nodes for disk write as it is coming in - and then have the 'head' node commit to slow log.
This would cover the gap. The Storm project is about doing real time computating in a distributed way.
As the brits say - 'mind the gap'. It is important part of this problem not to tell a mobile device '200' when we cannot ensure that eventually the file is delivered.
Also Chappa-ai Upload is closer to a reverse CDN than a proxy.
Oct 1st 2012 - Recap --
file-system-store is the branch that is running on production.
the riak system has crashed do to the way that i am using it I think that is is suffering from the high replay of writes over during requests. Secondary Indexing seems to be failing on the final write.
I have mailed the riak list. I think that the durability of riak is the right thing - the memory/filesystem likely needs to be tuned.
My intuition is that similar issues would come up in other large datastores - only in different ways and at different times.
Memcache is another possible solution that i know but it is not durable and since this is not a caching problem, but a replay problem it seems unsuitable.
Couchdb has basically the same properties of Riak.
Mongodb has basically the same properties as Couchdb with alternative implementations of query and replication - It could be a good slow-log for very large logs.
It is my feeling that Riak could be tuned to this problem and that perhaps setting the n_val to 1 would turn it into a single node correctly for this purpose.
It may be the case that no datastore is appropriate for this problem - since the data need not be propigated or avalible to wide network during processing - a keyvalue store has little use except in later analysis.
The problem of replay is complex and durability is not really a large concern over time - only for a span of time whe the request has not been replayed - It has not been considered how long the storage system should maintain any history of traffic.
the current file-system-store approach is using pstore as an index , using the filesystem as the key-value bucket, and then processing based on the index.
process_write_to_server is still a manual process
My last thought is that the filesystem locally could be a fine bucket for incomming requests - and that the problem of logging and tracing completed requests could then be folded into a seperate datastore - for queriablity and traceing of data - something relatively simple like a SQL store could be used - it would be a slow record of the requests that have passed through local chappa-ai node.
This would turn the scaling problem into traditional distributed nodes and put limit on request processing into server read/write disk-IO: which is a simpler problem and generally known.
[DEVICE 1..N] ==> [INTERNET] ==> [CHAPPA-AI X..Y] => Essentially two input cases - get and post (cond ((get? request) (slow-log (response (chappa-ai-passthrough request,device) device), request, device) ((post? request) (slow-log (response (chappa-ai-replay request,device) device) , request, device)) )
September 19th 2012 - Recap --
My initial exploration began with em-proxy. At some point - it became clear early on that it would be difficult to customize it and the documentation was very poor - the biggest concern that I had with it was lack of the ability to declare dynamically the request destination and to fullfull the requirment of 'return 200' on POST immediately and then replay the request as needed.
It would be nice to use a simple existing proxy solution but unfortunalty the 'return 200 on POST' requirement turns the problem towards a larger unknown.
The condition of the Zebra server at the time of our original design meeting was that the client was being forced to implement reply/retry code within it's internals.
The original design meeting was centered around creating a device that could answer the following questions:
- Did we use/request HTTP Keep Alive for HTTP? (yes)
- What was the throughput for a device to chappa-ai and from chappa-ai to server-endpoint.
- What is the Average Time to upload a post? (Basic stats - as simple as possible to compute)
- What is the average time to upload a video? (100MB)
- How many times do 'request' connections need to be retried to the server-endpoint before success?
- How many successful bytes in a failed connection?
- How many successful connections?
Server Design Notes:
IOS will background connections at 10 minutes by closing.
Current Client implementation will : - use 6 conncurent connections - restart connections every 3 minutes if incomplete. - re-run error connections every 5 seconds - russians never give up.
GET requests should pass through immediatley and syncronously for stable app
All requests should return immediate success 200, if they are malformed we are returning 422.
A Zebra Transaction is a sequence of K POST Requests
A Transaction may be identified by it's 'qrcode' parameter
K POST's with time DELTA-1 must all occur successfully
Implementation Notes:
Riak worked out really well - Secondary Indexs are nice to use. Needs 1GB machine to run operating system + space for DB
Ruby encoding issues are cropping up here - forced_encoding seems to still be transmitting broken unicode to server.
Base64 encoded to get around Ruby
currently the software is using deviceSerial as bucket and the qrcode (base64encoded) as a key. it stores a base64 encoded version of the origianl HTTP POST body and sets 'requested?' to 0
current issues:
raw body has to be base64 encoding as ruby seems to be writing \u0234 characters in a ASCII_8BIT stream even if i ask it not to.
HTTP keep-alive and millisecond request measurement is broken - Thin::Stat is doing it correct hovever if you use rackup to boot the server
[email protected] blazingpair password
this host has single node of riak running
- upstart export
- http://michaelvanrooijen.com/articles/2011/06/08-managing-and-monitoring-your-ruby-application-with-foreman-and-upstart/
rvmsudo foreman export upstart /etc/init -a chappa-ai -u bc -p 5000 rvmsudo foreman export upstart /etc/init -a chappa-ai-80 -u bc -p 80
- as root
service chappa-ai-80 stop start service chappa-ai stop start
service riak stop start
- set up tunnel to riak to use recon on 9100
ssh -L 9100:0.0.0.0:8098 [email protected] ssh -L 9100:0.0.0.0:8098 [email protected]
then locally acess rekon to see your buckets
- == Riak setup
for secondary indexes
{storage_backend, riak_kv_eleveldb_backend }