3/9/13

up-to-date profiles in nar-nar..

nar-nar shows profiles, and people review them, but the people need to click a link to see the newest stuff in the profile. It would be nice if the tool showed the newest version of the profile.

getting this information requires a "profile key". My profile key is "~01f894483a6d5ee6c1". Using this key, one can go to https://www.odesk.com/users/~01f894483a6d5ee6c1 to see my profile. One can also get that information in json format by going to http://www.odesk.com/api/profiles/v1/providers/~01f894483a6d5ee6c1.json.

it used to be that nar-nar wasn't receiving the profile keys, and they're not easy to get just from a user's username. But now it is receiving them..

so.. here's what I want to happen:
- I want all the records in nar-nar to have a profileKey.. new one's should get it the way they're getting it.. for old ones, I think I'll need to run some SQL queries on the main database
- when accounts are loaded for review, I want them to update to the latest and greatest information based on calls to the  http://www.odesk.com/api/profiles/v1/providers API call.. assuming this is fast enough (I'll be making 10 such API calls for each batch..)

- there's also an issue where I'm ignoring errors inserting stuff into the database. I do this because that error usually means that the item was already there; however, it may mean an actual error, in which case I want to know about it.. and it may be that there were some accounts that I failed to insert into nar-nar and don't know about because they had an error..

so, where to begin.. I feel like the place to start is making sure that all new stuff is processed correctly, and then try and repair the old stuff..

this involves two things for new stuff:
- get the profileKey
- check for errors when inserting into the database

checking for errors when inserting stuff is easier when using an "upsert", which either updates the item, or inserts it. I think this would make sense if the _id's are usernames.. though currently they are not.. they're a concatenation of _id and other information which may change..

I think it would be easier to handle this if the _id's were usernames. That would just make more sense. Hm.. how to convert them.. I think I need to delete records, and insert new ones with the new _id. One issue is that the system is live, meaning people are using it, so I don't want to change _id's of records that people are processing, since they'll get errors when they submit those results..

I think we want a script that works as follows:
- find items with old _id which are not grabbed by anyone, delete and readd them (telling me if it fails)
- keep running this until all the items use the new _id

of course, I should probably start by making the insertion script use the new _id..

ok, let's have a temporary flag that says whether an item is processed.. we'll call it "new_id".

hm.. let's see when the cron job is scheduled next.. actually, I know it's every hour, and right now it's 7:24, so I should be fine.. let's see if there's a new csv ready.. yup, there's a new one (added at 7:03), so we'll need to remember to mark it as unread after processing it with my test script..

ok, script seems to work.. let's upload it to nar-nar, so all the incoming stuff is processed correctly..

ok, I put some stuff to effectively only insert items if they don't exist already.. this seems silly.. I feel like another approach would be to check for the error that happens in insert when the item already exists, and ignore just that error, but that seems less stable, since I'd probably have to check for some text in the error which might change in the future.. let's do a quick check to see if mongodb has an option to pass to insert to repress that error.. hm.. I don't see one

humpf, ok, well, I guess I'll use my hack.. but I need to test it..

hm.. my hack doesn't work.. it finds an object with the given _id, but the object doesn't meet my other criteria, so instead of updating it, it tries to insert it, which fails, because an object with the given _id already exists..

let's see what the error looks like for failing an insert.. it looks like this: "MongoError: E11000 duplicate key error index"

let's see if that "E11000" seems likely to stick around.. well, let's assume so..

ok, I think the error object has a "name" of "MongoError" and "code" of "11000".. let's verify that in a test script.. ok good, this seems to work:


    if (e.name == 'MongoError' && e.code == 11000) {
        ...
    }


..and doesn't seem too short-lived..

ok, done.. it is now adding the profileKey and using the username as the _id.. we'll need to test that on the server when the next batch comes along..

for now, let's fix the old items..

let's make a script that searches for records without "new_id" which are not currently grabbed by anyone, delete them, and re-add them with the new _id..

ugg.. despite my precautions, I managed to delete half the database.. long story short, mongojs's forEach does not wait for the first callback to finish before making the second callback, so the script was blocked on many pseudo-threads removing items, and I just thought it was hung-up waiting to process the first item — which seemed to be taking too long — so I terminated the process..

..I think I need to ask mongoHQ for a backup.. hm.. apparently this costs money.. $250. sigh..

ok, e-mail sent to mongoHQ.. although my precautions did not include backing up the database (cry), I did at least print out items before deleting them, so many items can be recovered that way (though not all, because I ran the script a bit before adding even that precaution)..

ok, the items I printed have been added back.. and the script now seems to be correctly processing the remaining items.. though it's going to take a while.. fortunately the script doesn't need to run in one go. it will pick up where it left off..

ok, let's figure out how to get the profileKeys for the old items..

ok, I have access to the database.. just need to figure out where the data that I want is located.. I'd like a visual view of all the tables.. I had some program for that before I re-installed my operating system.. I wonder what it was called..

I think it was called "navicat".. hm.. not free, but a 30 day trial. I suppose I have another 30 days since I wiped my operating system..

great, I found the table I need.. now it has many items.. I think I'd like to get a dump of the last 2 months..

hm.. the most recent one is a day ago, so I guess I need to wait a day or two before it's caught up.. or.. let's see, how recently have I been getting profileKeys in the spreadsheets..

hm.. there's like a 5 hour gap.. drat.. I'll have to wait.. or maybe I should just set the items I have for now..

ok, let's gather all the spreadsheets I have into a giant table.. it could be useful..

humpf, there is a lot of effort to repair the past.. it's so much easier to just make things work better going forward, and in this case, I feel like people would be in support of that.. maybe I shouldn't be spending time trying to repair the past at all.. alas, I don't quite have the authority to make that call..

so, here's where things are at:
- I screwed up and deleted a lot of data. Hopefully mongoHQ will get back to me about that tomorrow
- I have a script updating a bunch of records in the most inefficient way possible, so it will take a while.. probably a couple more hours
- I have almost all the data I need to set all the profileKeys, but not quite.. hopefully I'll have it tomorrow

I'm waiting on so many things.. all of which are probably not worth much, since really this system just needs to work better going forward. I just want to cross this thing off my list, and I can't :(

update:

I slept a bit.. it finished updating the records to have usernames as keys..

now it's setting as many profileKeys as it can. It was doing that in a super inefficient way too, but I found a way to make it more efficient, involving mongodb's eval.. I could have done a similar thing with the script to delete/re-add items, but it would have been more dangerous, and I wanted to make more sure that I was alerted to any mishaps with that script.. the script to add profileKeys is a less dangerous script..

done.. some items remain undone, presumably because they were added during the 5-hour gap, so I'll need to set them later when the database updates..

let's see how many of them will actually be an issue.. hm.. it's a much smaller number, but non-zero..

the whole point of trying to update all the items to include a profileKey is so I don't need to write special-case code that does one thing if I have the profileKey, and another if I don't.. so I feel like having any items that violate this policy is as bad as having many items which violate it.

so I'll have to wait.. OR, I could download all the offending items and remove them from the database (it's not as if I haven't removed lots of other items accidentally), and add them back later when I add back the rest..

..before deleting anything, we'll naturally go ahead and backup the database..

done. note to self: the deleted items are in "_on_ice.txt"

so now, since everything has a profileKey, we can add code to take advantage of that..

ok, it works.. though now I feel like all the information stored in my records is essentially useless, since I fetch it all again before displaying it using the profileKey. I think I should just be storing the profileKey, and when people mark things as bad, store what it looked like when they marked it as bad..

No comments:

Post a Comment